Skywork-VL Reward – An open-source multimodal reward model by Skywork AI

What is Skywork-VL Reward?

Skywork-VL Reward is an open-source multimodal reward model developed by Skywork AI. It provides reliable reward signals for multimodal understanding and reasoning tasks. Based on the Qwen2.5-VL-7B-Instruct architecture, the model introduces a reward head and is trained on pairwise preference data to output scalar reward scores aligned with human preferences. The model achieves a state-of-the-art score of 73.1 on VL-RewardBench and performs exceptionally on RewardBench with a high score of 90.1. Leveraging Mixed Preference Optimization (MPO), Skywork-VL Reward significantly enhances multimodal reasoning capabilities and marks a new breakthrough in the field of multimodal reinforcement learning.

Key Functions of Skywork-VL Reward

Evaluate Multimodal Outputs: Assesses the quality of outputs generated by vision-language models (VLMs) to determine alignment with human preferences.
Provide Reward Signals: Outputs scalar reward scores reflecting the quality of generated content or its alignment with human preferences.
Support Multimodal Tasks: Applicable to a wide range of multimodal tasks such as image captioning and complex reasoning.
Enhance Model Performance: Supports Mixed Preference Optimization (MPO) with high-quality preference data to significantly improve multimodal reasoning ability.

Technical Principles of Skywork-VL Reward

Model Architecture: Built upon the Qwen2.5-VL-7B-Instruct architecture, which includes a vision encoder (Vision Transformer), a vision-language adapter, and a language model decoder. A reward head is added to the base model to output scalar reward scores. This reward head uses fully connected layers to process the final hidden states and generate reward values.
Dataset Construction: Integrates several open-source preference datasets (e.g., LLaVA-Critic-113k, Skywork-Reward-Preference-80K-v0.2, RLAIF-V-Dataset) along with internally annotated data for complex reasoning tasks. Data quality and consistency are ensured through deduplication, similarity filtering, and preference-based filtering. High-quality preference data are generated using advanced VLM reasoning tools to enhance generalization.
Training Method: Utilizes a pairwise preference loss function to train the model by comparing the relative quality of two candidate responses. A two-stage fine-tuning process is employed: the first stage trains on multimodal preference data, and the second stage incorporates pure-text preference data to further enhance performance in text-only scenarios.

Project Links for Skywork-VL Reward

HuggingFace Model Hub: https://huggingface.co/Skywork/Skywork-VL-Reward
arXiv Technical Paper: https://arxiv.org/pdf/2505.07263

Application Scenarios of Skywork-VL Reward

Content Generation Evaluation: Evaluates the quality of multimodal content generation such as image captions and video subtitles, determining whether the outputs are accurate and align with human preferences.
Reasoning Task Optimization: In complex multimodal reasoning tasks (e.g., visual question answering, geometric reasoning), assesses the validity of reasoning processes and outcomes to help optimize reasoning models.
Model Alignment: Ensures that outputs from multimodal models are aligned with human values and ethical standards, preventing harmful or misleading content.
Mixed Preference Optimization (MPO): Acts as a critical component in MPO training by providing high-quality preference data to improve the reasoning and generalization capabilities of multimodal models.
Benchmark Testing: Serves as a benchmark evaluation tool for multimodal tasks, enabling the assessment and comparison of different models and promoting the advancement of multimodal technology.