SimpleAR – An Image Generation Model Developed by Fudan University in Collaboration with ByteDance’s Seed Team

What is SimpleAR?

SimpleAR is a fully autoregressive image generation model jointly developed by the Visual Learning Lab at Fudan University and ByteDance’s Seed team. It adopts a streamlined autoregressive architecture that enables high-quality image generation through optimized training and inference processes. With only 500 million parameters, SimpleAR is capable of generating images at 1024×1024 resolution, achieving outstanding results on benchmarks such as GenEval.

SimpleAR uses a three-stage training strategy—pretraining, supervised fine-tuning, and reinforcement learning—to significantly enhance text alignment and generation quality. It is compatible with modern acceleration techniques, reducing inference time to under 14 seconds.

SimpleAR – An Image Generation Model Developed by Fudan University in Collaboration with ByteDance's Seed Team

Key Features of SimpleAR

High-Quality Text-to-Image Generation:
SimpleAR is a fully autoregressive visual generation framework that produces high-resolution (1024×1024) images using just 500 million parameters. It has achieved an impressive score of 0.59 on benchmarks like GenEval.
Multimodal Fusion Generation:
Text and visual tokens are treated equally within a unified Transformer architecture, supporting multimodal modeling and enhancing the model’s ability to generate images guided by text.

Technical Principles of SimpleAR

Autoregressive Generation Mechanism:
SimpleAR follows a classic autoregressive generation method, generating image content step-by-step by predicting the next token. It breaks down an image into a sequence of discrete tokens and reconstructs the full image token by token.
Multimodal Fusion:
Text encoding and visual generation are unified within a decoder-only Transformer architecture. This improves parameter efficiency and supports joint modeling between text and vision modalities, enabling more natural alignment between textual descriptions and generated images.
Three-Stage Training Approach:
- Pretraining: Learn general vision and language patterns from large-scale datasets.
- Supervised Fine-Tuning (SFT): Further enhance generation quality and instruction-following based on supervised data.
- Reinforcement Learning (GRPO): Post-training optimization using simple reward functions (e.g., CLIP) to improve aesthetics and multimodal alignment of generated content.
Inference Acceleration:
Leveraging technologies like vLLM, SimpleAR significantly reduces inference time. For example, the 0.5B model can generate a high-quality 1024×1024 image in under 14 seconds.
Visual Tokenizer Selection:
SimpleAR uses Cosmos as the visual tokenizer. While effective, it has limitations in handling low-resolution details and image reconstruction, leaving room for future improvement.

Project Repository

GitHub Repository: https://github.com/wdrink/SimpleAR
HuggingFace Paper Page: https://huggingface.co/papers/2504.11455
arXiv Technical Paper: https://arxiv.org/pdf/2504.11455

Application Scenarios for SimpleAR

Creative Design:
Designers can use SimpleAR to quickly generate high-quality images for advertisements, posters, and artistic creations.
Virtual Scene Generation:
Generate virtual environments from textual descriptions for game development, VR, and AR applications.
Multimodal Machine Translation:
Leverage SimpleAR’s multimodal capabilities to combine image information with text for more accurate and enriched translations.
Video Description Generation:
Combine image generation with video content to generate detailed descriptions for video segments.
Augmented Reality (AR) and Virtual Reality (VR):
Create virtual visuals that blend seamlessly into real environments for use in industrial maintenance, educational demonstrations, tourism guides, and more. Generate high-quality virtual scenes and objects to enhance user experiences in immersive applications.
Image Enhancement and Restoration:
Improve the quality of low-resolution images or restore missing/damaged image parts by generating detailed content based on surrounding areas.