Ovis-U1 – Alibaba’s Multimodal Unified Model

What is Ovis-U1？

Ovis-U1 is a multimodal unified model with 3 billion parameters developed by Alibaba Group’s Ovis team. The model integrates three core capabilities: multimodal understanding, text-to-image generation, and image editing. Based on advanced architectures and collaborative unified training methods, it achieves high-fidelity image synthesis and efficient text-visual interaction. Ovis-U1 attains leading results across multiple academic benchmarks for multimodal understanding, generation, and editing, demonstrating strong generalization ability and excellent performance.

Ovis-U1 – Alibaba's Multimodal Unified Model

Main Features of Ovis-U1

Multimodal Understanding: Supports understanding complex visual scenes and textual content, answering questions about images, performing visual question answering (VQA), and generating image descriptions.
Text-to-Image Generation: Generates high-quality images from textual descriptions, supporting various styles and complex scene depictions.
Image Editing: Precisely edits images based on textual instructions, including adding, adjusting, replacing, or deleting elements, as well as style transfer.

Technical Principles of Ovis-U1

Architecture Design:
- Visual Decoder: Based on a diffusion Transformer architecture (MMDiT), generates high-quality images from text embeddings.
- Bidirectional Token Refiner: Enhances interaction between text and visual embeddings, improving performance in text-to-image synthesis and image editing tasks.
- Visual Encoder: A fine-tuned pretrained visual encoder (e.g., Aimv2-large-patch14-448) adapted for multimodal tasks.
- Adapter: Connects the visual encoder with the multimodal large language model (MLLM), aligning visual and textual embeddings.
- Multimodal Large Language Model (MLLM): Core component processing both text and visual information, supporting various multimodal tasks.
Unified Training Method: Ovis-U1 is trained simultaneously on multimodal understanding, text-to-image generation, and image editing tasks using shared knowledge to boost generalization. The training is divided into six progressive stages, each targeting specific tasks to gradually enhance multimodal capabilities.
Data Composition:
- Multimodal Understanding Data: Includes public datasets such as COYO, Wukong, Laion, ShareGPT4V, CC3M, and internally developed datasets.
- Text-to-Image Generation Data: Uses Laion5B and JourneyDB datasets, with detailed image captions generated from pretrained models.
- Image + Text to Image Generation Data: Covers tasks including image editing, reference image–driven generation, and pixel-level controlled generation.
Performance Optimization: For image editing tasks, text and image guidance coefficients (CFG) are adjusted to precisely control editing instructions. The model’s multimodal ability is comprehensively evaluated on benchmarks such as OpenCompass, GenEval, DPG-Bench, ImgEdit-Bench, and GEdit-Bench-EN.

Project Links

GitHub Repository: https://github.com/AIDC-AI/Ovis-U1
HuggingFace Model Hub: https://huggingface.co/AIDC-AI/Ovis-U1-3B
Technical Paper: https://github.com/AIDC-AI/Ovis-U1/blob/main/docs/Ovis_U1_Report.pdf
Online Demo: https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B

Application Scenarios of Ovis-U1

Content Creation: Generates high-quality images and video frame sequences from text prompts, providing artists and video editors with creative assistance to significantly improve production efficiency.
Advertising and Marketing: Produces attractive advertising images and promotional posters based on product features and target audience descriptions, aiding social media marketing content creation and enhancing brand reach and user engagement.
Game Development: Creates game scenes, characters, and props images from game background and character descriptions, offering creative inspiration and preliminary assets for game design.
Architectural Design: Generates architectural concept images, interior scenes, and furniture layouts based on style and environment descriptions, helping clients understand designs quickly and supporting designers in efficient presentation and communication.
Scientific Research: Produces visualizations of complex scientific phenomena, data, experimental scenes, and equipment images to assist researchers in better understanding and presenting their findings.