Qwen2.5-Omni – Alibaba’s Open-Source End-to-End Multimodal Model

What is Qwen2.5-Omni?

Qwen2.5-Omni is a flagship multi-modal model in the open-source Qwen series by Alibaba, featuring 7B parameters. Qwen2.5-Omni boasts powerful multi-modal perception capabilities, capable of processing text, image, audio, and video inputs. It supports streaming text generation and natural speech synthesis output, enabling real-time voice and video chat. Qwen2.5-Omni utilizes a unique Thinker-Talker architecture. The Thinker is responsible for processing and understanding multi-modal inputs, generating high-level representations and text, while the Talker converts these representations and text into smooth speech output. The model achieves state-of-the-art performance on multi-modal tasks (such as OmniBench), surpassing competitors like Google’s Gemini-1.5-Pro by a wide margin across all dimensions. It also excels in single-modal tasks (such as speech recognition, translation, and audio understanding). Qwen2.5-Omni is available for free trials on Qwen Chat. The model is now open-source, allowing developers and enterprises to download and use it for free, including for commercial purposes, and deploy it on terminal smart devices such as smartphones.

Qwen2.5-Omni – Alibaba's Open-Source End-to-End Multimodal Model

The main functions of Qwen2.5-Omni

Text Processing: Understands and processes various text inputs, including natural language conversations, commands, long texts, etc., supporting multiple languages.
Image Recognition: Supports the recognition and understanding of image content.
Audio Processing: Equipped with speech recognition capabilities, it can convert speech into text, understand voice commands, and generate natural and smooth speech output.
Video Understanding: Supports processing video inputs, synchronously analyzing the visual and audio information in videos, and enabling features such as video content understanding and video question answering.
Real-time Voice and Video Chat: Supports real-time processing of voice and video streams, enabling smooth voice and video chat functionality.

The Technical Principles of Qwen2.5-Omni

Thinker-Talker Architecture: Based on the Thinker-Talker architecture, the model is divided into two main components. The Thinker serves as the “brain” of the model, responsible for processing and understanding multimodal inputs such as text, audio, and video, generating high-level semantic representations and corresponding text outputs. The Talker acts as the “mouth” of the model, responsible for converting the high-level representations and text generated by the Thinker into fluent speech outputs.
Time-aligned Multimodal Positional Embedding (TMRoPE): To synchronize the timestamps of video inputs with audio, Qwen2.5-Omni introduces a new positional embedding method called TMRoPE (Time-aligned Multimodal RoPE). It organizes audio and video frames in an interleaved manner to ensure the temporal order of the video sequence. TMRoPE encodes the three-dimensional positional information (time, height, width) of multimodal inputs into the model by decomposing the original rotational embedding into three components: time, height, and width. For text inputs, the same IDs are used, making TMRoPE functionally equivalent to one-dimensional RoPE. For audio inputs, each 40ms audio frame is assigned the same ID, introducing absolute time positional encoding. For image inputs, the time ID of each visual token remains unchanged, while the height and width IDs are assigned based on the token’s position in the image. For video inputs, the time IDs of audio and video frames are alternated to ensure temporal alignment.
Stream Processing and Real-time Response: Based on the block processing method, long-sequence multimodal data is divided into smaller blocks for separate processing, reducing processing latency. The model incorporates a sliding window mechanism to limit the context range of the current token, further optimizing the efficiency of stream generation. Audio and video encoders utilize a block attention mechanism to process audio and video data in chunks, with each chunk taking approximately 2 seconds to process. For stream speech generation, the Flow-Matching and BigVGAN models are employed to convert generated audio tokens into waveforms block by block, supporting real-time speech output.
The three training stages of Qwen2.5-Omni:
◦ Stage 1: Fix the language model parameters and train only the visual and audio encoders using a large amount of audio-text and image-text paired data to enhance the model’s understanding of multimodal information.
◦ Stage 2: Unfreeze all parameters and train with a broader range of data, including mixed data of images, videos, audio, and text, to further improve the model’s comprehensive understanding of multimodal information.
◦ Stage 3: Train on long-sequence data (32k) to enhance the model’s ability to understand complex long-sequence data.

Project address of Qwen2.5-Omni

Project official website: https://qwenlm.github.io/blog/qwen2.5-omni/
GitHub repository: https://github.com/QwenLM/Qwen2.5-Omni
Hugging Face model hub: https://huggingface.co/Qwen/Qwen2.5-Omni-7B
Technical paper: https://github.com/QwenLM/Qwen2.5-Omni/blob/main/assets/Qwen2.5_Omni
Online experience Demo: https://huggingface.co/spaces/Qwen/Qwen2.5-Omni-7B-Demo

The Performance of the Qwen2.5-Omni Model

Multimodal tasks: Achieve state-of-the-art performance in multimodal tasks such as OmniBench.
Single-modal tasks: Deliver excellent performance in multiple domains, including speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective naturalness).

Application scenarios of Qwen2.5-Omni

Intelligent Customer Service: Based on voice and text interaction, it provides users with real-time consultation and Q&A services.
Virtual Assistant: As a personal virtual assistant, it helps users complete various tasks, such as schedule management, information inquiry, reminders, etc.
Education Field: Used in online education, it provides features such as voice explanation, interactive Q&A, and homework tutoring.
Entertainment Field: In the fields of games, videos, etc., it provides functions such as voice interaction, character dubbing, and content recommendation, enhancing users’ sense of participation and immersion and providing a richer entertainment experience.
Intelligent Office: Assisting in office work, for example, generating high-quality meeting minutes and notes through voice meetings to improve work efficiency.