Paper detail

MiMo-Audio: Audio Language Models are Few-Shot Learners

Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

preprint2025arXivOpen access

Core Team Dong Zhang Gang Wang Jinlong Xue Kai Fang Liang Zhao Rui Ma Shuhuai Ren Shuo Liu Tao Guo Weiji Zhuang Xin Zhang Xingchen Song Yihan Yan Yongzhe He Cici Bowen Shen Chengxuan Zhu Chong Ma Chun Chen Heyu Chen Jiawei Li Lei Li Menghang Zhu Peidian Li Qiying Wang Sirui Deng Weimin Xiong Wenshan Huang Wenyu Yang Yilin Jiang Yixin Yang Yuanyuan Tian Yue Ma Yue Yu Zihan Zhang Zihao Yue Bangjun Xiao Bingquan Xia Bofei Gao Bowen Ye Can Cai Chang Liu Chenhong He Chunan Li Dawei Zhu Duo Zhang Fengyuan Shi Guoan Wang Hailin Zhang Hanglong Lv Hanyu Li Hao Tian Heng Qu Hongshen Xu Houbin Zhang Huaqiu Liu Jiangshan Duo Jianguang Zuo Jianyu Wei Jiebao Xiao Jinhao Dong Jun Shi Junhao Hu Kainan Bao Kang Zhou Linghao Zhang Meng Chen Nuo Chen Peng Zhang Qianli Chen Qiantong Wang Rang Li Shaohui Liu Shengfan Wang Shicheng Li Shihua Yu Shijie Cao Shimao Chen Shuhao Gu Weikun Wang Wenhan Ma Xiangwei Deng Xing Yong Xing Zhang Xu Wang Yifan Song Yihao Zhao Yingbo Zhao Yizhao Gao Yu Cheng Yu Tu Yudong Wang Zhaojun Huang Zhengju Tang Zhenru Lin Zhichao Song Zhipeng Xu Zhixian Zheng Zihan Jiang

Computation and Language eess.AS Sound

Open graph Reviews Discussion

Signal facts

What is known right now

Open access100 authors3 topics

Imported metadata coverageMissing code, dataset, citation and institution fields are tracked without dominating the paper.Details

Citations: 0Reviews: 0Saves: 0Code: not linkedDataset: not linkedInstitutions: 0

Next steps

Decide what to do with this paper

Like0 Dislike0Score 0

Use like or dislike for the fast social read. The more specific scholarly feedback stays available below when needed.

Save to reading list0

Keep the important signals around this paper in one place: votes, save state, collection context, reviews and the metadata you need before deciding what to do next.

Authors

Institutions

No institution affiliation has been imported for this paper yet.

Add specific reaction

Move through nearby people, institutions, topics and adjacent work without leaving the paper page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

ContributeLeave structured feedbackUse the review template when you have a concrete strength, concern or method question.Open review form

No structured reviews yet. High-signal critique starts here.

DiscussAdd a high-signal commentKeep quick notes, caveats and replication pointers separate from formal reviews.Open comment form

No discussion yet. The first strong comment sets the tone.

MiMo-Audio: Audio Language Models are Few-Shot Learners

What is known right now

Decide what to do with this paper

Keep the important context close to the paper

Authors

Institutions

Research map

Building this map preview

0 review(s)

0 comment(s)