SWEET-RL – A Multi-Round Reinforcement Learning Framework Released by Meta

What is SWEET-RL?

SWEET-RL is a multi-turn reinforcement learning framework introduced by Meta, specifically designed for training large language model (LLM) agents to perform collaborative reasoning tasks. SWEET-RL optimizes the “critic” model based on additional information during training (such as reference solutions). The critic model provides rewards for each step, helping the “actor” model better allocate credit and optimize its policy. SWEET-RL demonstrates outstanding performance on the ColBench benchmark, achieving a 6% improvement in success rate and win rate compared to other advanced algorithms in tasks such as backend programming and front-end design. This enables the Llama-3.1-8B model to match or even surpass the performance of top-tier models like GPT-4o.

The main functions of SWEET-RL

Optimizing Multi-round Interaction Tasks: SWEET-RL is specifically optimized for complex tasks requiring multi-round interactions, such as backend programming and front-end design.
Efficient Credit Assignment: By introducing additional information (such as reference solutions) during training, rewards are provided for each step, accurately evaluating the value of each action and addressing the challenges of credit assignment in multi-round tasks.
Support for Diverse Task Types: Capable of handling complex front-end design tasks, demonstrating its versatility and adaptability across different types of tasks.

The Technical Principles of SWEET-RL

Additional Information During Training: SWEET-RL optimizes the “critic” model based on additional information during training (such as reference solutions). The critic model provides rewards for each step, helping the “actor” model better allocate credit.
Bradley-Terry Objective: SWEET-RL directly trains the advantage function using the Bradley-Terry objective function. The advantage function evaluates the effectiveness of each action in the current state. This avoids the need to first train a value function to predict the expected utility of the current state and action, achieving better alignment with pre-trained LLMs.
Asymmetric Information Structure: Based on an asymmetric actor-critic structure, where the critic model accesses additional information during training and the actor model accesses interaction history. This enables the critic to more accurately evaluate the value of actions, and the actor to optimize its policy based on these evaluations.
Parameterized Advantage Function: The advantage function is parameterized as the average log probability of each action and trained using the trajectory-level Bradley-Terry objective. This parameterization is more consistent with the pre-training objectives of LLMs, improving the model’s generalization ability.

The project address of SWEET-RL

GitHub Repository: https://github.com/facebookresearch/sweet_rl
Hugging Face Model Hub: https://huggingface.co/datasets/facebook/collaborative_agent_bench
arXiv Technical Paper: https://arxiv.org/pdf/2503.15478

Application scenarios of SWEET-RL

Text Proofreading: Assist authors and editors in quickly correcting typos and sensitive content in articles.
Social Media Review: Ensure the compliance of social media content to protect the reputation of individuals or enterprises.
Advertising Compliance: Review advertising copy to avoid legal and market risks caused by incorrect content.
Academic Publishing: Ensure the accuracy and rigor of textbooks and academic works.
Multimedia Content Detection: Review videos, audio, and images to ensure the legality and compliance of multimedia content.