Skywork-Reward-V2 – The second-generation reward model series open-sourced by Kunlun Tech

What is Skywork-Reward-V2?

Skywork-Reward-V2 is the second-generation reward model series open-sourced by Kunlun Wanwei, comprising eight models based on different base architectures and sizes, with parameter scales ranging from 600 million to 8 billion. The Skywork-Reward-V2 series has achieved top performance across seven major reward model evaluation benchmarks, demonstrating exceptional capabilities. Its success is attributed to the Skywork-SynPref-40M dataset, a hybrid dataset containing 40 million preference pairs, meticulously filtered through a two-stage human-machine collaborative process. Skywork-Reward-V2 excels in general preference alignment, objective correctness, safety, and exhibits strong generalization in Best-of-N selection and style bias resistance.

Key Features of Skywork-Reward-V2

General Preference Alignment: Accurately determines which response better aligns with human preferences, ensuring outputs are more natural and context-appropriate (e.g., selecting more polite or coherent replies in conversational AI).
Objective Correctness Assessment: Effectively identifies factual accuracy, crucial for tasks like math calculations or fact-based queries, ensuring responses are correct.
Safety Judgment: Detects harmful or inappropriate content (e.g., violence, discrimination) to ensure ethical and safe AI outputs.
Best-of-N Selection: Efficiently ranks multiple candidate responses to choose the best one, improving decision-making in multi-choice scenarios.
Style Bias Resistance: Maintains fairness across varied writing styles (e.g., literary vs. technical text), avoiding bias due to stylistic differences.

Technical Principles

High-Quality Dataset (Skywork-SynPref-40M):
- Contains 40 million preference pairs, refined via a two-stage human-AI collaboration process.
- 26 million high-quality samples were selected, ensuring diversity and accuracy12.
Training Based on Bradley-Terry Model:
- Uses pairwise comparisons to compute relative preference scores, optimizing reward signals to better capture human preferences.
Iterative Training & Optimization:
- Identifies weak points in each training round, retrieves similar samples, and uses multi-model consistency for automated labeling, enhancing model performance3.
Model Architecture & Parameter Tuning:
- Trained on Qwen3 and LLaMA3 base models, with varying parameter sizes for different use cases.
- Optimized via learning rate adjustments and batch processing for stable convergence.

Project Links

GitHub: https://github.com/SkyworkAI/Skywork-Reward-V2
Hugging Face: https://huggingface.co/collections/Skywork/skywork-reward-v2-685cc86ce5d9c9e4be500c84
arXiv Paper: https://arxiv.org/pdf/2507.01352

Applications

Dialogue System Optimization: Enhances chatbot responses for better user interaction.
Content Recommendation: Improves personalized suggestions in recommendation engines.
Educational Assistance: Evaluates student answers for accuracy and provides feedback.
Content Moderation: Filters harmful or non-compliant content on social platforms.
Game Development: Optimizes in-game narratives and dialogues for immersive experiences.