MiMo-VL – Xiaomi’s Open-Source Multimodal Large Model

AI Tools updated 2w ago dongdong
11 0

 

What is MiMo-VL?

MiMo-VL is an open-source multimodal large model developed by Xiaomi. It consists of a vision encoder, a cross-modal projection layer, and a language model. The vision encoder is based on Qwen2.5-ViT, and the language model is Xiaomi’s proprietary MiMo-7B. It adopts a multi-stage pretraining strategy and utilizes 2.4 trillion tokens of multimodal data. Performance is enhanced through hybrid online reinforcement learning. MiMo-VL achieves strong results in fundamental visual understanding, complex reasoning, and GUI interaction tasks—for instance, it achieves 66.7% on MMMU-val, surpassing Gemma 3 27B, and 59.4% on OlympiadBench, outperforming some 72B models.

MiMo-VL – Xiaomi's Open-Source Multimodal Large Model


Key Features of MiMo-VL

  • Complex Image Reasoning and Q&A: Capable of performing reasoning and answering questions based on complex images, accurately interpreting visual content with coherent explanations and answers.

  • GUI Operation and Interaction: Supports GUI operations involving over 10 steps, understanding and executing complex graphical user interface instructions.

  • Video and Language Understanding: Understands video content and performs reasoning and Q&A in conjunction with language.

  • Long Document Parsing and Reasoning: Can handle long documents for in-depth reasoning and analysis.

  • User Experience Optimization: Uses a hybrid online reinforcement learning algorithm (MORL) to comprehensively improve model reasoning, perception, and user experience.


Technical Architecture of MiMo-VL

  • Vision Encoder: Based on Qwen2.5-ViT, supports native resolution input to preserve more visual detail.

  • Cross-modal Projection Layer: Aligns vision and language features using an MLP structure.

  • Language Model: Utilizes Xiaomi’s proprietary MiMo-7B model, specifically optimized for complex reasoning.

  • Multi-Stage Pretraining: Involves collecting, cleaning, and synthesizing high-quality multimodal pretraining data, including image-text pairs, video-text pairs, and GUI operation sequences—totaling 2.4 trillion tokens. The data proportions are adjusted across stages to strengthen long-range multimodal reasoning.


Four-Stage Pretraining Strategy

  1. Projection Layer Warm-up: Uses image-text pair data, sequence length up to 8K.

  2. Vision-Language Alignment: Uses interleaved image-text data, sequence length 8K.

  3. Multimodal Pretraining: Incorporates OCR, video, GUI, and reasoning data, sequence length 8K.

  4. Long-context SFT: Uses high-resolution images, long documents, and extended reasoning chains with sequence length up to 32K.


Project Repositories


Application Scenarios of MiMo-VL

  • Smart Customer Service: Performs complex image reasoning and Q&A tasks, offering users more intelligent and convenient support.

  • Smart Home: Understands household photos and videos to execute GUI Grounding tasks, improving efficiency and experience in human-computer interaction.

  • Smart Healthcare: Understands medical images and texts to assist doctors in diagnosis and treatment.

  • Education: Aids in solving math problems and learning programming by providing step-by-step solutions and code examples.

  • Research and Academia: Supports logical reasoning and algorithm development, helping researchers verify hypotheses and design experiments.

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...