Skywork-VL Reward – An open-source multimodal reward model by Skywork AI
What is Skywork-VL Reward?
Skywork-VL Reward is an open-source multimodal reward model developed by Skywork AI. It provides reliable reward signals for multimodal understanding and reasoning tasks. Based on the Qwen2.5-VL-7B-Instruct architecture, the model introduces a reward head and is trained on pairwise preference data to output scalar reward scores aligned with human preferences. The model achieves a state-of-the-art score of 73.1 on VL-RewardBench and performs exceptionally on RewardBench with a high score of 90.1. Leveraging Mixed Preference Optimization (MPO), Skywork-VL Reward significantly enhances multimodal reasoning capabilities and marks a new breakthrough in the field of multimodal reinforcement learning.
Key Functions of Skywork-VL Reward
-
Evaluate Multimodal Outputs: Assesses the quality of outputs generated by vision-language models (VLMs) to determine alignment with human preferences.
-
Provide Reward Signals: Outputs scalar reward scores reflecting the quality of generated content or its alignment with human preferences.
-
Support Multimodal Tasks: Applicable to a wide range of multimodal tasks such as image captioning and complex reasoning.
-
Enhance Model Performance: Supports Mixed Preference Optimization (MPO) with high-quality preference data to significantly improve multimodal reasoning ability.
Technical Principles of Skywork-VL Reward
-
Model Architecture: Built upon the Qwen2.5-VL-7B-Instruct architecture, which includes a vision encoder (Vision Transformer), a vision-language adapter, and a language model decoder. A reward head is added to the base model to output scalar reward scores. This reward head uses fully connected layers to process the final hidden states and generate reward values.
-
Dataset Construction: Integrates several open-source preference datasets (e.g., LLaVA-Critic-113k, Skywork-Reward-Preference-80K-v0.2, RLAIF-V-Dataset) along with internally annotated data for complex reasoning tasks. Data quality and consistency are ensured through deduplication, similarity filtering, and preference-based filtering. High-quality preference data are generated using advanced VLM reasoning tools to enhance generalization.
-
Training Method: Utilizes a pairwise preference loss function to train the model by comparing the relative quality of two candidate responses. A two-stage fine-tuning process is employed: the first stage trains on multimodal preference data, and the second stage incorporates pure-text preference data to further enhance performance in text-only scenarios.
Project Links for Skywork-VL Reward
-
HuggingFace Model Hub: https://huggingface.co/Skywork/Skywork-VL-Reward
-
arXiv Technical Paper: https://arxiv.org/pdf/2505.07263
Application Scenarios of Skywork-VL Reward
-
Content Generation Evaluation: Evaluates the quality of multimodal content generation such as image captions and video subtitles, determining whether the outputs are accurate and align with human preferences.
-
Reasoning Task Optimization: In complex multimodal reasoning tasks (e.g., visual question answering, geometric reasoning), assesses the validity of reasoning processes and outcomes to help optimize reasoning models.
-
Model Alignment: Ensures that outputs from multimodal models are aligned with human values and ethical standards, preventing harmful or misleading content.
-
Mixed Preference Optimization (MPO): Acts as a critical component in MPO training by providing high-quality preference data to improve the reasoning and generalization capabilities of multimodal models.
-
Benchmark Testing: Serves as a benchmark evaluation tool for multimodal tasks, enabling the assessment and comparison of different models and promoting the advancement of multimodal technology.