Emu 3.5 – A Multimodal World Model Developed by the Beijing Academy of Artificial Intelligence (BAAI)

What is Emu 3.5？

Emu 3.5 is a multimodal world model released by the Beijing Academy of Artificial Intelligence (BAAI). It is trained end-to-end on over 10 trillion multimodal tokens, primarily derived from internet videos totaling approximately 790 years of content, enabling it to learn and internalize the dynamic laws of the real physical world.Built upon a 34-billion-parameter dense Transformer architecture, Emu 3.5 adopts a “next-state prediction” objective, achieving unified understanding and generation across text, image, and video modalities.The model introduces multiple breakthroughs, including Discrete Diffusion Adaptation (DiDA) — a novel technique that improves image generation speed by nearly 20×, solving the speed bottleneck of autoregressive models. Emu 3.5 demonstrates strong capabilities in visual storytelling, visual instruction, general-purpose image editing and generation, world modeling, and embodied task planning, producing text-image narratives, step-by-step tutorials, high-quality images, and continuous visual sequences for virtual environments and robotic task decomposition.

Key Features of Emu 3.5

Multimodal Content Generation:
Generates high-quality text and image content — or a combination of both — applicable to creative industries such as advertising, film, and gaming.
Visual Storytelling:
Creates immersive text-image stories with coherent narratives and consistent visual styles, offering new storytelling modes for education and entertainment.
Visual Instruction:
Produces step-by-step tutorials with visual examples that intuitively demonstrate processes such as painting or crafting, helping users understand and execute tasks more effectively.
General Image Editing and Generation:
Excels in open-world image editing and spatiotemporal manipulation, with superior text rendering accuracy and naturalness compared to existing state-of-the-art models.
World Modeling and Exploration:
Generates continuous visual sequences within virtual environments while maintaining geometric, semantic, and visual consistency — ideal for VR and game development.
Embodied Operations:
Decomposes complex robotic tasks into subtasks with language instructions and keyframe images, forming a foundation for training more general embodied agents and advancing robotics research.

Technical Principles of Emu 3.5

Native Multimodal Architecture:
Based on a 34B dense Transformer, Emu 3.5 employs a Next-State Prediction objective to unify understanding and generation across text, image, and video, breaking the traditional boundaries between modalities.
Large-Scale Pretraining:
Trained end-to-end on over 10 trillion multimodal tokens, primarily sourced from internet videos and their transcribed speech, totaling 790 years of footage. This large-scale pretraining allows the model to learn the dynamics and causal relationships of the physical world.
Discrete Diffusion Adaptation (DiDA):
A novel solution to the image generation speed bottleneck of autoregressive models. DiDA improves generation speed by nearly 20× while maintaining high output quality, bridging the gap between autoregressive and diffusion-based models.
Supervised Fine-Tuning:
Fine-tuned on a high-quality dataset of 150 billion samples, covering diverse and complex tasks. This process establishes a unified multimodal interaction interface and enhances the model’s ability to understand and follow complex user instructions.
Large-Scale Multimodal Reinforcement Learning:
Employs a multi-dimensional reward system that evaluates aesthetic quality, text-image alignment, narrative coherence, and other criteria. Reinforcement learning further enhances the model’s multimodal reasoning and generation quality.

Project Links

Official Website: https://zh.emu.world
Technical Report: https://zh.emu.world/Emu35_tech_report.pdf

Application Scenarios of Emu 3.5

Content Creation:
Generates high-quality text and image content for advertising, film, gaming, and other creative industries, providing rich visual and narrative assets.
Education and Training:
Creates immersive text-image stories and step-by-step tutorials to help students better understand concepts and enhance learning experiences.
Virtual Reality and Game Development:
Produces continuous, coherent visual sequences for virtual environments, supporting realistic and consistent scene generation.
Robotics and Embodied Intelligence:
Decomposes complex robotic operations into subtasks with language and visual guidance, aiding robots in understanding and performing sophisticated tasks.
Image Editing and Design:
Excels in open-world image editing and temporal manipulation, providing designers with powerful, efficient creative tools.
Intelligent Customer Interaction:
Generates rich, text-and-image-based responses for customer service or interactive systems, enhancing user engagement and communication experience.