Xiaomi-MiMo-Audio – Xiaomi’s open-source end-to-end speech large model

AI Tools updated 1w ago dongdong
31 0

What is Xiaomi-MiMo-Audio?

Xiaomi-MiMo-Audio is Xiaomi’s first open-source native end-to-end speech large model. Built on an innovative pre-training architecture and trained with billions of hours of data, it is the first in the speech domain to achieve few-shot generalization through In-Context Learning (ICL), breaking the long-standing bottleneck of relying heavily on large-scale labeled data. In multiple standard evaluation benchmarks, Xiaomi-MiMo-Audio significantly outperforms open-source models of the same parameter scale, achieving best-in-class performance at 7B. On the MMAU benchmark for audio understanding, it surpasses Google’s Gemini-2.5-Flash, and on the Big Bench Audio S2T benchmark for complex audio reasoning, it outperforms OpenAI’s GPT-4o-Audio-Preview.

Xiaomi has open-sourced the pre-trained model MiMo-Audio-7B-Base, the instruction-tuned model MiMo-Audio-7B-Instruct, and a 1.2B-parameter Tokenizer model, supporting both audio reconstruction and audio-to-text tasks.

Xiaomi-MiMo-Audio – Xiaomi’s open-source end-to-end speech large model


Key Features of Xiaomi-MiMo-Audio

  • Few-shot generalization: For the first time in the speech domain, it achieves few-shot generalization through ICL, enabling fast adaptation to new tasks—marking the “GPT-3 moment” for speech.

  • Cross-modal alignment: Post-training enhances cross-modal alignment of intelligence, emotional expressiveness, performance, and safety. Voice dialogues achieve highly human-like naturalness, emotional expression, and interaction adaptability.

  • Speech understanding and generation: Outperforms open-source models of the same parameter size across multiple standard benchmarks, delivering best-in-class 7B performance, even surpassing some closed-source models.

  • Complex audio reasoning: Excels in complex audio reasoning tasks on the Big Bench Audio S2T benchmark, demonstrating strong advanced reasoning capabilities.

  • Speech continuation: MiMo-Audio-7B-Base is the first open-source speech model with speech continuation capability.

  • Hybrid thinking: The first open-source model to introduce a “Thinking” mechanism into both speech understanding and generation, enabling hybrid reasoning.

  • Audio-to-text tasks: The Tokenizer model supports A2T (Audio-to-Text) tasks, covering more than ten million hours of speech data.


Technical Principles of Xiaomi-MiMo-Audio

  • Innovative pre-training architecture: Trained with billions of hours of data, enabling stronger speech processing capabilities.

  • Few-shot generalization with ICL: Rapidly adapts to new tasks with minimal examples.

  • Cross-modal alignment: Post-training further enhances intelligence, emotion, expressiveness, and safety, delivering highly human-like conversational performance.

  • Lossless compression pre-training: Uses lossless audio compression during pre-training, improving task generalization and demonstrating emergent behavior in speech models.

  • Tokenizer model: A 1.2B-parameter Transformer-based Tokenizer trained from scratch, covering tens of millions of hours of speech data, supporting both audio reconstruction and audio-to-text tasks.

  • Lightweight post-training (SFT): Further boosts performance in speech understanding and generation.

  • Hybrid thinking mechanism: Incorporates “Thinking” into both understanding and generation, enhancing complex reasoning.


Project Links


Application Scenarios of Xiaomi-MiMo-Audio

  • Voice interaction: Powering intelligent voice assistants with more natural and intelligent interactions, supporting multiple languages and dialects.

  • Speech generation: Producing high-quality voice content for audiobooks, announcements, navigation, and more.

  • Speech-to-text: Supporting A2T tasks for meeting transcription, voice input, and voice search.

  • Audio content creation: Helping creators generate audio scripts or voice content, improving productivity.

  • Emotional expression: Delivering rich emotional expressiveness in conversations, suitable for companion robots, customer service systems, and other emotion-driven applications.

  • Speech recognition and understanding: Excelling in audio understanding benchmarks, making it suitable for speech recognition and voice command control.

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...