DanceGRPO – A Unified Visual Generation Reinforcement Learning Framework Jointly Developed by ByteDance Seed and HKU

What is DanceGRPO

DanceGRPO is the first unified visual generative reinforcement learning framework jointly developed by ByteDance Seed and the University of Hong Kong. It applies reinforcement learning (RL) to the field of visual generation, covering two major generative paradigms (diffusion and rectified flow), three tasks (text-to-image, text-to-video, and image-to-video), four base models (Stable Diffusion, HunyuanVideo, FLUX, SkyReels-I2V), and five types of reward models (image/video aesthetics, text-image alignment, video dynamics quality, binary rewards).

DanceGRPO addresses the limitations of existing RLHF (Reinforcement Learning from Human Feedback) methods in visual generation. It enables seamless adaptation across various generation paradigms, tasks, base models, and reward models—significantly improving model performance, reducing memory usage, supporting training on large prompt datasets, and enabling migration to rectified flow and video generation models.

Key Features of DanceGRPO

Improved visual generation quality: Produces images and videos that better align with human aesthetics, appearing more realistic and natural.
Unified generative paradigms and tasks: Supports multiple tasks such as text-to-image, text-to-video, and image-to-video.
Support for diverse models and rewards: Compatible with various base and reward models to meet diverse application needs.
Higher training efficiency and stability: Reduces memory load, increases training efficiency, and enhances stability.
Enhanced human feedback learning: Enables the model to better learn from human feedback and generate more human-aligned content.

Technical Principles Behind DanceGRPO

Modeling the denoising process as a Markov Decision Process (MDP): DanceGRPO models the denoising process in diffusion models and rectified flows as an MDP, where the prompt is part of the state, and each denoising step is treated as an action—laying the foundation for reinforcement learning application.
SDE-based sampling equations: To meet GRPO’s exploration requirements, the sampling processes of diffusion and rectified flow models are unified using Stochastic Differential Equations (SDEs). For diffusion models, the forward SDE models noise addition, and the reverse SDE handles data generation. For rectified flows, SDE introduces stochasticity into the reverse process, enabling RL-based exploration.
Optimizing with the GRPO objective function: Inspired by Deepseek-R1’s GRPO strategy, given a prompt, multiple output samples are generated, and the policy model is optimized based on maximizing the GRPO objective function. This function accounts for reward signals and the advantage function across different samples, helping the model better learn to adjust its generation strategy according to feedback.
Noise initialization and timestep selection: Noise initialization is critical in DanceGRPO. To prevent reward hacking, samples from the same prompt share the same initialization noise. A timestep selection strategy optimizes specific steps to reduce computation without compromising performance.
Integration of multiple reward models and advantage function aggregation: To ensure stable training and high-quality outputs, DanceGRPO incorporates multiple reward models. As different reward models may have varying scales and distributions, advantage function aggregation is used to balance their contributions, enabling the model to optimize for multiple evaluation metrics and generate content more aligned with human preferences.

DanceGRPO Project Links

Project Website: https://dancegrpo.github.io/
GitHub Repository: https://github.com/XueZeyue/DanceGRPO
arXiv Paper: https://arxiv.org/pdf/2505.07818

Application Scenarios of DanceGRPO

Text-to-image generation: Generate high-quality images based on textual descriptions for use in advertising, game development, and more, improving creative efficiency.
Text-to-video generation: Generate smooth, coherent videos from text prompts, applicable in video ads and educational video production, reducing manual effort.
Image-to-video generation: Convert static images into dynamic videos for animation production, virtual reality, and enhanced visual experiences.
Multimodal content creation: Combine text, image, and video generation to produce diverse content for multimedia education and interactive entertainment, enhancing immersion.
Creative design and art generation: Assist artists and designers in quickly generating ideas and artworks, inspiring creativity and boosting productivity.