Skywork-VL Reward – An open-source multimodal reward model by Skywork AI

AI Tools updated 4w ago dongdong
13 0

What is Skywork-VL Reward?

Skywork-VL Reward is an open-source multimodal reward model developed by Skywork AI. It provides reliable reward signals for multimodal understanding and reasoning tasks. Based on the Qwen2.5-VL-7B-Instruct architecture, the model introduces a reward head and is trained on pairwise preference data to output scalar reward scores aligned with human preferences. The model achieves a state-of-the-art score of 73.1 on VL-RewardBench and performs exceptionally on RewardBench with a high score of 90.1. Leveraging Mixed Preference Optimization (MPO), Skywork-VL Reward significantly enhances multimodal reasoning capabilities and marks a new breakthrough in the field of multimodal reinforcement learning.

Skywork-VL Reward – An open-source multimodal reward model by Skywork AI


Key Functions of Skywork-VL Reward

  • Evaluate Multimodal Outputs: Assesses the quality of outputs generated by vision-language models (VLMs) to determine alignment with human preferences.

  • Provide Reward Signals: Outputs scalar reward scores reflecting the quality of generated content or its alignment with human preferences.

  • Support Multimodal Tasks: Applicable to a wide range of multimodal tasks such as image captioning and complex reasoning.

  • Enhance Model Performance: Supports Mixed Preference Optimization (MPO) with high-quality preference data to significantly improve multimodal reasoning ability.


Technical Principles of Skywork-VL Reward

  • Model Architecture: Built upon the Qwen2.5-VL-7B-Instruct architecture, which includes a vision encoder (Vision Transformer), a vision-language adapter, and a language model decoder. A reward head is added to the base model to output scalar reward scores. This reward head uses fully connected layers to process the final hidden states and generate reward values.

  • Dataset Construction: Integrates several open-source preference datasets (e.g., LLaVA-Critic-113k, Skywork-Reward-Preference-80K-v0.2, RLAIF-V-Dataset) along with internally annotated data for complex reasoning tasks. Data quality and consistency are ensured through deduplication, similarity filtering, and preference-based filtering. High-quality preference data are generated using advanced VLM reasoning tools to enhance generalization.

  • Training Method: Utilizes a pairwise preference loss function to train the model by comparing the relative quality of two candidate responses. A two-stage fine-tuning process is employed: the first stage trains on multimodal preference data, and the second stage incorporates pure-text preference data to further enhance performance in text-only scenarios.


Project Links for Skywork-VL Reward


Application Scenarios of Skywork-VL Reward

  • Content Generation Evaluation: Evaluates the quality of multimodal content generation such as image captions and video subtitles, determining whether the outputs are accurate and align with human preferences.

  • Reasoning Task Optimization: In complex multimodal reasoning tasks (e.g., visual question answering, geometric reasoning), assesses the validity of reasoning processes and outcomes to help optimize reasoning models.

  • Model Alignment: Ensures that outputs from multimodal models are aligned with human values and ethical standards, preventing harmful or misleading content.

  • Mixed Preference Optimization (MPO): Acts as a critical component in MPO training by providing high-quality preference data to improve the reasoning and generalization capabilities of multimodal models.

  • Benchmark Testing: Serves as a benchmark evaluation tool for multimodal tasks, enabling the assessment and comparison of different models and promoting the advancement of multimodal technology.

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...