DreamO – An image customization generation framework jointly launched by ByteDance and Peking University

What is DreamO?

DreamO is a unified framework for customized image generation, jointly developed by ByteDance’s creative team and the School of Electronic and Computer Engineering at Peking University Shenzhen Graduate School.
It is built on a pretrained Diffusion Transformer (DiT) model and enables flexible customization across various image generation tasks.
DreamO seamlessly integrates multiple conditions — such as identity, subject, style, and background — using feature routing constraints and placeholder strategies to enhance consistency and disentanglement between conditions.
Employing a staged training strategy, DreamO ensures efficient convergence and maintains high-quality outputs even in complex tasks.
The framework is widely applicable to scenarios like virtual try-on, style transfer, and subject-driven generation, providing powerful customization capabilities for image generation.

Key Features of DreamO

Multi-Condition Integration:
Supports the seamless customization of multiple conditions including identity, subject, style, and background in image generation.
High-Quality Generation:
Utilizes a staged training strategy to guarantee high-quality outputs and correct biases introduced by low-quality data.
Flexible Condition Control:
Enables precise control over the positioning and layout of conditions within the generated images.
Broad Applicability:
Handles complex multi-condition scenarios, suitable for tasks like virtual try-on, style transfer, and subject-driven generation.

Technical Principles of DreamO

Diffusion Transformer (DiT) Framework:
Uses the Diffusion Transformer as its core architecture, unifying the processing of different types of inputs (such as text, images, and conditions) for customized image generation.
The diffusion model generates images by progressively denoising, while the transformer architecture enhances understanding and handling of the input conditions.
Feature Routing Constraints:
To improve consistency between generated results and reference images, feature routing constraints are introduced.
These constraints optimize the attention mechanisms between conditional and generated images, ensuring specific regions in the output correspond accurately to the reference conditions and reducing condition entanglement.
Placeholder Strategy:
By adding placeholders (e.g., [ref#1]) in textual descriptions, DreamO associates condition images with specific objects mentioned in the text, achieving precise control over where conditions appear in the generated image.
Staged Training Strategy:
A multi-stage training process is employed, including an initial simple-task phase, a comprehensive multi-task training phase, and a quality alignment phase to correct biases.
This strategy helps the model efficiently converge under complex data distributions while maintaining high-quality outputs.
Large-Scale Training Dataset:
To ensure broad generalization capabilities, a large-scale dataset covering tasks such as identity customization, subject-driven generation, virtual try-on, and style transfer is constructed, enabling the model to learn versatile generation abilities across different conditions.

Project Links for DreamO

Official Website:
https://mc-e.github.io/project/DreamO/
GitHub Repository:
https://github.com/bytedance/DreamO
Technical Paper (arXiv):
https://arxiv.org/pdf/2504.16915

Application Scenarios for DreamO

Virtual Try-On:
Users can upload their photos along with clothing images to generate realistic try-on results.
Style Transfer:
Transform ordinary photos into artistic styles or generate various visual effects based on design sketches, ideal for artistic creation and design inspiration.
Subject-Driven Generation:
Create personalized avatars or virtual characters based on user-uploaded photos, supporting multi-subject fusion for use in social media, gaming, and animation production.
Identity Customization:
Generate images featuring specific individual characteristics, preserving and blending identity features, suitable for virtual social platforms and personalized content creation.
Creative Content Generation:
Produce creative advertisements, movie effects, or educational scene images based on textual descriptions and condition images, supporting complex customization tasks to meet creative needs.