USO – AI Painting Model Released by ByteDance

What is USO？

USO (Unified Style-Subject Optimized) is an AI painting model released by ByteDance’s UXO team. It enables free combination of any subject with any style in any scene, producing images with high subject consistency, strong style fidelity, and a natural, non-plastic look. USO is trained on a large-scale triplet dataset and adopts a decoupled learning strategy that aligns style features while separating content and style. It also introduces Style Reward Learning (SRL) to further enhance model performance. Alongside the model, the team released USO-Bench, a benchmark designed to comprehensively evaluate both style similarity and subject fidelity. Experiments show that USO achieves state-of-the-art performance among open-source models across both dimensions.

Main Features of USO

Style and Subject Fusion: Freely combines any subject with any style, producing images that preserve subject features while adhering to the chosen style—solving the long-standing challenge of blending style with subject.
High-Fidelity Generation: Maintains strong subject consistency and high style fidelity during image generation, ensuring natural and high-quality outputs.
Multi-Scene Applications: Applicable to diverse domains, including artistic creation, advertising design, and game development.
Open-Source Support: Fully open-sourced with training code, inference scripts, model weights, and datasets, providing rich resources for researchers and developers.
State-of-the-Art Performance: Achieves leading results in both subject consistency and style similarity, thanks to large-scale triplet data and decoupled learning strategies.
Benchmark Testing: Released USO-Bench, which comprehensively evaluates style similarity and subject fidelity, offering a unified comparison standard for future models.

Technical Principles of USO

Large-Scale Triplet Dataset Construction: Builds datasets of content images, style images, and corresponding stylized images, providing a strong foundation for training.
Decoupled Learning Strategy: A two-phase training process that aligns style features while disentangling content and style, avoiding feature interference and enabling precise fusion.
Style Reward Learning (SRL): Introduces reward signals to optimize generation quality, balancing style similarity with subject consistency, further boosting model performance.
Unified Framework: Merges style-driven and subject-driven tasks into a single framework, resolving the traditional trade-off between the two and enabling co-optimization.
Two-Stage Training Pipeline: Stage one trains style alignment to enable style reproduction; stage two performs content–style decoupled training for joint conditional generation, with SRL supervising the overall process.

Core Values of USO

Innovative Collaborative Decoupling Paradigm: Breaks the separation between style and subject generation tasks, proving that cross-task joint learning enables deeper content–style disentanglement and mutual enhancement.
Powerful Unified Generation Model: The first model to simultaneously achieve SOTA subject consistency and style similarity within a single framework, delivering impressive results and generality.
Reward Learning Enhancement: Successfully applies reward learning to style generation, providing an effective path for fine-grained control and improved aesthetics.
First Unified Evaluation Benchmark: USO-Bench fills the evaluation gap in this domain, offering a fair and comprehensive platform for comparison.

USO Project Links

Official Website: https://bytedance.github.io/USO/
GitHub Repository: https://github.com/bytedance/USO
arXiv Paper: https://arxiv.org/pdf/2508.18966

Model Capabilities of USO

Precise Style Transfer: Accurately transfers different styles onto new content while preserving brushstrokes and color patterns of the original style without distorting the subject.
Strong Subject Preservation: Locks subject features during style transformation, maintaining integrity across multiple styles.
Unified Generation Ability: Simultaneously satisfies both style and subject requirements, producing images that perfectly combine stylistic fidelity with subject preservation.
High-Quality Outputs: Achieves SOTA results in subject-driven, style-driven, and joint subject–style generation tasks, producing natural, realistic, and high-quality images.
High Adaptability: Handles a wide range of subjects (people, animals, environments) and styles (oil painting, ink wash, comic art, etc.) with strong adaptability.
Quantitative Superiority: On USO-Bench, USO significantly outperforms all existing open-source SOTA models across key metrics (e.g., CLIP-I, DINO, CSD) for both subject-driven and style-driven tasks. On the more challenging joint subject–style task, USO also leads by a large margin, showcasing its unified generation power.

Application Scenarios of USO

Art Creation: Artists can apply different styles to the same subject, quickly generating multiple sketches or final artworks, sparking creativity and boosting efficiency.
Advertising Design: Designers can generate targeted advertising visuals tailored to themes and audiences, improving engagement and relevance.
Game Development: Developers can stylize game characters and environments, e.g., transforming realistic characters into cartoon styles to enrich visual diversity.
Film and TV Production: Supports VFX artists in rapidly generating stylized scenes or character concepts (e.g., futuristic characters for sci-fi films).
Education: A teaching tool in art and design education, helping students understand and apply different artistic styles—for instance, showing how the same work appears across different art styles.