MiDashengLM – Xiaomi’s Open-Source Efficient Audio Understanding Large Model

What is MiDashengLM?

MiDashengLM is Xiaomi’s open-source, high-efficiency large model for audio understanding, specifically the MiDashengLM-7B version. Built on the Xiaomi Dasheng audio encoder and the Qwen2.5-Omni-7B Thinker decoder, the model uses a unified audio captioning alignment strategy to achieve comprehensive understanding of speech, environmental sounds, and music. It features exceptional performance and high inference efficiency, with first-token latency only a quarter of that in leading models. It also supports large-scale parallel processing. All training data is fully open-source and available for both academic and commercial use. The model is well-suited for applications such as smart cockpits and smart homes, advancing multimodal interaction experiences.

MiDashengLM – Xiaomi's Open-Source Efficient Audio Understanding Large Model

Key Features of MiDashengLM

Audio Captioning:
Converts audio input—including speech, environmental sounds, and music—into natural language descriptions, helping users quickly grasp audio content.
Audio Classification:
Identifies specific audio categories (e.g., speech, environmental sound, music), useful in scenarios like sound recognition and music classification.
Automatic Speech Recognition (ASR):
Transcribes spoken language into text, supporting multiple languages and widely applied in voice assistants and in-vehicle systems.
Audio Question Answering:
Answers questions based on input audio content. Applicable in smart cockpit scenarios for understanding ambient sound or music-related queries.
Multimodal Interaction:
Combines audio with other modalities (such as text or images) for integrated understanding, enhancing interactive experiences on smart devices.

Technical Overview of MiDashengLM

Model Architecture

Audio Encoder:
Based on the Xiaomi Dasheng encoder, which transforms raw audio signals into high-dimensional representations. It excels in understanding non-speech audio like environmental sounds and music, capturing rich semantic information.
Decoder:
Built on the Qwen2.5-Omni-7B Thinker, an autoregressive decoder that converts encoded features into natural language. The decoder supports various tasks including audio captioning, audio QA, and ASR.

Training Strategy

Unified Audio Caption Alignment:
Avoids the limitations of traditional ASR transcription by using non-monotonic global semantic mapping. This forces the model to learn deep semantic associations within audio scenes, achieving unified understanding of speech, environmental sound, and music.
Multi-Expert Annotation Pipeline:
Training data is generated using a pipeline of expert models with fine-grained labels across speech, vocals, music, and acoustic environments. Final descriptions are synthesized using the DeepSeek-R1 reasoning model.
Datasets:
Trained on publicly available datasets spanning speech, environmental sound, and music, totaling over 1 million hours. Original labels are discarded during pretraining, replaced with rich textual descriptions to compel the model to learn comprehensive audio semantics.

Inference Optimization

Efficient Inference:
By optimizing the encoder design, the output frame rate is reduced from Qwen2.5-Omni’s 25Hz to 5Hz, significantly lowering computational load and improving inference speed.
Massive Parallel Processing:
Supports large batch sizes (batch size = 512). When processing 30 seconds of audio and generating 100 tokens on an 80GB GPU, MiDashengLM achieves over 20× the throughput of Qwen2.5-Omni-7B.

Project Links

GitHub Repository:
https://github.com/xiaomi-research/dasheng-lm
HuggingFace Model Page:
https://huggingface.co/mispeech/midashenglm-7b
Technical Report:
MiDashengLM Technical Paper (PDF)
Online Demo:
https://huggingface.co/spaces/mispeech/MiDashengLM-7B

Application Scenarios of MiDashengLM

Smart Cockpits:
Improve driving safety and interaction through voice assistants and environmental sound recognition.
Smart Homes:
Enable voice control and environmental audio monitoring for seamless home automation.
Voice Assistants:
Offer multilingual speech recognition and intelligent dialogue capabilities to meet diverse user needs.
Audio Content Creation and Tagging:
Automatically generate audio captions and metadata to streamline creative workflows.
Education and Learning:
Assist in language and music education by providing pronunciation feedback and theoretical guidance.