Xiaomi-MiMo-Audio – Xiaomi’s open-source end-to-end speech large model

What is Xiaomi-MiMo-Audio？

Xiaomi-MiMo-Audio is Xiaomi’s first open-source native end-to-end speech large model. Built on an innovative pre-training architecture and trained with billions of hours of data, it is the first in the speech domain to achieve few-shot generalization through In-Context Learning (ICL), breaking the long-standing bottleneck of relying heavily on large-scale labeled data. In multiple standard evaluation benchmarks, Xiaomi-MiMo-Audio significantly outperforms open-source models of the same parameter scale, achieving best-in-class performance at 7B. On the MMAU benchmark for audio understanding, it surpasses Google’s Gemini-2.5-Flash, and on the Big Bench Audio S2T benchmark for complex audio reasoning, it outperforms OpenAI’s GPT-4o-Audio-Preview.

Xiaomi has open-sourced the pre-trained model MiMo-Audio-7B-Base, the instruction-tuned model MiMo-Audio-7B-Instruct, and a 1.2B-parameter Tokenizer model, supporting both audio reconstruction and audio-to-text tasks.

Key Features of Xiaomi-MiMo-Audio

Few-shot generalization: For the first time in the speech domain, it achieves few-shot generalization through ICL, enabling fast adaptation to new tasks—marking the “GPT-3 moment” for speech.
Cross-modal alignment: Post-training enhances cross-modal alignment of intelligence, emotional expressiveness, performance, and safety. Voice dialogues achieve highly human-like naturalness, emotional expression, and interaction adaptability.
Speech understanding and generation: Outperforms open-source models of the same parameter size across multiple standard benchmarks, delivering best-in-class 7B performance, even surpassing some closed-source models.
Complex audio reasoning: Excels in complex audio reasoning tasks on the Big Bench Audio S2T benchmark, demonstrating strong advanced reasoning capabilities.
Speech continuation: MiMo-Audio-7B-Base is the first open-source speech model with speech continuation capability.
Hybrid thinking: The first open-source model to introduce a “Thinking” mechanism into both speech understanding and generation, enabling hybrid reasoning.
Audio-to-text tasks: The Tokenizer model supports A2T (Audio-to-Text) tasks, covering more than ten million hours of speech data.

Technical Principles of Xiaomi-MiMo-Audio

Innovative pre-training architecture: Trained with billions of hours of data, enabling stronger speech processing capabilities.
Few-shot generalization with ICL: Rapidly adapts to new tasks with minimal examples.
Cross-modal alignment: Post-training further enhances intelligence, emotion, expressiveness, and safety, delivering highly human-like conversational performance.
Lossless compression pre-training: Uses lossless audio compression during pre-training, improving task generalization and demonstrating emergent behavior in speech models.
Tokenizer model: A 1.2B-parameter Transformer-based Tokenizer trained from scratch, covering tens of millions of hours of speech data, supporting both audio reconstruction and audio-to-text tasks.
Lightweight post-training (SFT): Further boosts performance in speech understanding and generation.
Hybrid thinking mechanism: Incorporates “Thinking” into both understanding and generation, enhancing complex reasoning.

Project Links

Official site: https://xiaomimimo.github.io/MiMo-Audio-Demo/
GitHub: https://github.com/XiaomiMiMo/MiMo-Audio
HuggingFace models:
- MiMo-Audio-7B-Base: https://huggingface.co/XiaomiMiMo/MiMo-Audio-7B-Base
- MiMo-Audio-7B-Instruct: https://huggingface.co/XiaomiMiMo/MiMo-Audio-7B-Instruct
- Tokenizer: https://huggingface.co/XiaomiMiMo/MiMo-Audio-Tokenizer
Technical paper: https://github.com/XiaomiMiMo/MiMo-Audio/blob/main/MiMo-Audio-Technical-Report.pdf

Application Scenarios of Xiaomi-MiMo-Audio

Voice interaction: Powering intelligent voice assistants with more natural and intelligent interactions, supporting multiple languages and dialects.
Speech generation: Producing high-quality voice content for audiobooks, announcements, navigation, and more.
Speech-to-text: Supporting A2T tasks for meeting transcription, voice input, and voice search.
Audio content creation: Helping creators generate audio scripts or voice content, improving productivity.
Emotional expression: Delivering rich emotional expressiveness in conversations, suitable for companion robots, customer service systems, and other emotion-driven applications.
Speech recognition and understanding: Excelling in audio understanding benchmarks, making it suitable for speech recognition and voice command control.