dots.vlm1 – Xiaohongshu Hi Lab’s First Open-Source Multimodal Large Model

What is dots.vlm1？

dots.vlm1 is the first open-source multimodal large model released by Xiaohongshu Hi Lab. It is based on a 1.2-billion-parameter vision encoder called NaViT, trained from scratch, and the DeepSeek V3 large language model (LLM), combining powerful visual perception and textual reasoning capabilities. The model performs excellently on visual understanding and reasoning tasks, approaching the level of closed-source state-of-the-art (SOTA) models, while maintaining competitive performance on text tasks. The NaViT vision encoder in dots.vlm1 is fully trained from scratch, natively supports dynamic resolution, and incorporates pure visual supervision on top of textual supervision to enhance perceptual ability. The training data includes various synthetic data strategies covering diverse image types and their descriptions, significantly improving data quality.

Main Features of dots.vlm1

Powerful Visual Understanding: Accurately recognizes and comprehends content in images, including complex charts, tables, documents, graphics, etc., supports dynamic resolution, and is suitable for a variety of visual tasks.
Efficient Text Generation and Reasoning: Based on the DeepSeek V3 LLM, it can generate high-quality textual descriptions and performs well in reasoning tasks involving math, code, and other text.
Multimodal Data Processing: Supports interleaved image-text data processing, enabling integrated reasoning across visual and textual information, suitable for multimodal applications.
Flexible Adaptation and Extension: Connects the vision encoder and language model via a lightweight MLP adapter, allowing flexible adaptation and extension for different tasks.
Open Source and Openness: Provides complete open-source code and models to support developers in research and application development, promoting the advancement of multimodal technologies.

Technical Principles of dots.vlm1

NaViT Vision Encoder: dots.vlm1 uses a 1.2-billion-parameter vision encoder NaViT trained from scratch rather than fine-tuning an existing mature model. It natively supports dynamic resolution, handling images of varying resolutions. On top of textual supervision, it adds pure visual supervision to enhance image perception capability.
Multimodal Data Training: The model uses diversified multimodal training data, including ordinary images, complex charts, tables, documents, graphics, and corresponding text descriptions (such as Alt Text, Dense Caption, Grounding, etc.). Synthetic data techniques and interleaved web and PDF image-text data are introduced, with rewriting and cleaning steps to improve data quality and enhance multimodal understanding.
Vision-Language Model Fusion: dots.vlm1 integrates the vision encoder with the DeepSeek V3 large language model through a lightweight MLP adapter, achieving effective fusion of visual and language information to support multimodal task processing.
Three-Stage Training Process: The model training proceeds in three stages: vision encoder pretraining, VLM pretraining, and VLM post-training. By gradually increasing image resolution and introducing diverse training data, the model’s generalization and multimodal task capabilities are enhanced.

Project Links for dots.vlm1

GitHub Repository: https://github.com/rednote-hilab/dots.vlm1
Hugging Face Model Hub: https://huggingface.co/rednote-hilab/dots.vlm1.inst
Online Demo: https://huggingface.co/spaces/rednote-hilab/dots-vlm1-demo

Application Scenarios of dots.vlm1

Complex Chart Reasoning: Analyzes and reasons about complex charts to help users better understand and interpret the information within.
STEM Problem Solving: Assists in solving problems in science, technology, engineering, and mathematics (STEM), offering problem-solving ideas.
Long-Tail Recognition: Possesses strong recognition capabilities for low-frequency categories or objects.
Visual Reasoning: Handles reasoning tasks involving visual information, such as obstacle detection, product comparison analysis, etc.
Image-Text Q&A and Interaction: Supports image-text combined question answering with multi-turn dialogue, providing coherent responses based on context.
Content Recommendation: Based on multimodal data, offers personalized content recommendations, such as recommending related images, text, or videos on the Xiaohongshu platform.