Skywork-Reward-V2 – The second-generation reward model series open-sourced by Kunlun Tech
What is Skywork-Reward-V2?
Skywork-Reward-V2 is the second-generation reward model series open-sourced by Kunlun Wanwei, comprising eight models based on different base architectures and sizes, with parameter scales ranging from 600 million to 8 billion. The Skywork-Reward-V2 series has achieved top performance across seven major reward model evaluation benchmarks, demonstrating exceptional capabilities. Its success is attributed to the Skywork-SynPref-40M dataset, a hybrid dataset containing 40 million preference pairs, meticulously filtered through a two-stage human-machine collaborative process. Skywork-Reward-V2 excels in general preference alignment, objective correctness, safety, and exhibits strong generalization in Best-of-N selection and style bias resistance.
Key Features of Skywork-Reward-V2
-
General Preference Alignment: Accurately determines which response better aligns with human preferences, ensuring outputs are more natural and context-appropriate (e.g., selecting more polite or coherent replies in conversational AI).
-
Objective Correctness Assessment: Effectively identifies factual accuracy, crucial for tasks like math calculations or fact-based queries, ensuring responses are correct.
-
Safety Judgment: Detects harmful or inappropriate content (e.g., violence, discrimination) to ensure ethical and safe AI outputs.
-
Best-of-N Selection: Efficiently ranks multiple candidate responses to choose the best one, improving decision-making in multi-choice scenarios.
-
Style Bias Resistance: Maintains fairness across varied writing styles (e.g., literary vs. technical text), avoiding bias due to stylistic differences.
Technical Principles
-
High-Quality Dataset (Skywork-SynPref-40M):
-
Contains 40 million preference pairs, refined via a two-stage human-AI collaboration process.
-
26 million high-quality samples were selected, ensuring diversity and accuracy12.
-
-
Training Based on Bradley-Terry Model:
-
Uses pairwise comparisons to compute relative preference scores, optimizing reward signals to better capture human preferences.
-
-
Iterative Training & Optimization:
-
Identifies weak points in each training round, retrieves similar samples, and uses multi-model consistency for automated labeling, enhancing model performance3.
-
-
Model Architecture & Parameter Tuning:
-
Trained on Qwen3 and LLaMA3 base models, with varying parameter sizes for different use cases.
-
Optimized via learning rate adjustments and batch processing for stable convergence.
-
Project Links
-
Hugging Face: https://huggingface.co/collections/Skywork/skywork-reward-v2-685cc86ce5d9c9e4be500c84
-
arXiv Paper: https://arxiv.org/pdf/2507.01352
Applications
-
Dialogue System Optimization: Enhances chatbot responses for better user interaction.
-
Content Recommendation: Improves personalized suggestions in recommendation engines.
-
Educational Assistance: Evaluates student answers for accuracy and provides feedback.
-
Content Moderation: Filters harmful or non-compliant content on social platforms.
-
Game Development: Optimizes in-game narratives and dialogues for immersive experiences.