T2I – R1 – A text – to – image model jointly launched by The Chinese University of Hong Kong and Shanghai AI Laboratory

What is T2I-R1

T2I-R1 is a novel text-to-image generation model jointly developed by The Chinese University of Hong Kong and Shanghai AI Lab. By introducing a dual-level reasoning mechanism—semantic-level Chain of Thought (CoT) and token-level CoT—it decouples high-level image planning from low-level pixel generation, significantly improving image quality and robustness. Built on the BiCoT-GRPO reinforcement learning framework, T2I-R1 employs a multi-expert reward ensemble to optimize the generation process. In multiple benchmark tests, T2I-R1 outperforms leading models like FLUX.1, demonstrating strong capabilities in understanding complex scenes and generating high-quality images.

T2I - R1 – A text - to - image model jointly launched by The Chinese University of Hong Kong and Shanghai AI Laboratory

Key Features of T2I-R1

High-Quality Image Generation: Utilizes a dual-level reasoning mechanism (semantic-level and token-level CoT) to generate images that better align with human expectations.
Complex Scene Understanding: Capable of reasoning through complex semantics in user prompts, generating highly relevant images that perform well in rare or ambiguous scenarios.
Enhanced Diversity: Semantic-level CoT enables better planning, increasing the diversity of generated outputs and avoiding repetitive results.

Technical Principles of T2I-R1

Dual-Level CoT Reasoning Mechanism:
- Semantic-Level CoT: Performs reasoning and planning based on textual prompts before image generation, defining the overall structure and layout of elements.
- Token-Level CoT: Focuses on local details and visual coherence by generating image tokens block-by-block during the image synthesis process.
BiCoT-GRPO Algorithm: A reinforcement learning-based approach that jointly optimizes both semantic-level and token-level CoT reasoning. It introduces Group-Relative Reward and a multi-expert reward ensemble to evaluate image quality from multiple perspectives.
Multi-Expert Reward Ensemble: Combines several vision models—including human preference models, object detectors, and visual question answering models—to evaluate aesthetics, text-image alignment, and object presence. This ensemble strategy prevents overfitting to a single reward model and improves the stability and generalizability of generated results.

Project Links

GitHub Repository: https://github.com/CaraJ7/T2I-R1
arXiv Technical Paper: https://arxiv.org/pdf/2505.00703

Application Scenarios for T2I-R1

Creative Design: Assists designers in rapidly generating concept sketches and artistic works, saving time.
Content Production: Helps produce characters and scenes for advertising, film, and gaming, enhancing productivity.
Educational Support: Generates visuals aligned with educational content to help students better understand abstract concepts.
Virtual Reality: Creates virtual scenes or objects based on user input, enhancing immersion.
Intelligent Customer Service: Generates intuitive visuals to help users better understand products or services.