UniPixel – a pixel-level multimodal large model jointly developed by The Hong Kong Polytechnic University and Tencent

What is UniPixel？

UniPixel is the first unified pixel-level multimodal large model developed by The Hong Kong Polytechnic University and Tencent ARC Lab, focusing on fine-grained understanding and interaction with images and videos. Within a single model, it performs object grounding, pixel-level segmentation, and region reasoning. Through its innovative object memory mechanism and unified visual encoding design, UniPixel achieves precise tracking and semantic understanding of objects in videos.

Built on the Qwen2.5-VL framework, UniPixel supports three types of visual interactions—points, boxes, and masks—and surpasses traditional 72B-parameter models on nine major visual benchmarks. The key innovation lies in deeply integrating visual segmentation with language reasoning, addressing the limitations of traditional models in handling complex referential and dynamic regional understanding.

UniPixel – a pixel-level multimodal large model jointly developed by The Hong Kong Polytechnic University and Tencent

Main Features of UniPixel

Pixel-Level Vision-Language Understanding:
Enables pixel-level alignment between visual signals and linguistic semantics, supporting fine-grained tasks such as image/video segmentation, regional understanding, and PixelQA.
Unified Object Grounding and Segmentation:
Seamlessly integrates object grounding and segmentation. Based on visual prompts, the model generates relevant masks and performs reasoning over these intermediate pointers, enabling fine-grained pixel-level inference.
Multi-Task Capability:
Excels across multiple benchmarks, including ReVOS, MeViS, Ref-YouTube-VOS, and RefCOCO/+/g, and introduces a new PixelQA task that combines object grounding, segmentation, and question answering.
Flexible Visual Prompt Handling:
Processes visual prompts (points, boxes, or masks) to generate segmentation masks and reason over them. Supports both single-frame and multi-frame video understanding and mask-based QA tasks.

Technical Principles of UniPixel

Unified Framework Design:
UniPixel integrates object grounding and segmentation into one architecture, bridging coarse scene understanding and fine-grained pixel reasoning, forming a foundation for complex visual reasoning.
Object Memory Bank:
Includes an object memory module that stores features extracted from grounding tasks, providing contextual information for subsequent segmentation and reasoning tasks, thus enhancing performance on pixel-level understanding.
Multi-Stage Training Strategy:
Combines pretraining, grounding fine-tuning, and segmentation fine-tuning to progressively enhance the model’s performance on pixel-level tasks and adapt to different applications.
End-to-End Mask Generation:
Directly generates pixel-level masks from natural language descriptions, achieving deep fusion of language and vision for tasks like image/video segmentation and regional understanding.
Strong Reasoning Capabilities:
In the VideoRefer-Bench-Q QA task, the UniPixel-7B model achieved 74.1% accuracy, surpassing powerful baselines including GPT-4o, demonstrating its exceptional capability in complex visual reasoning.
Model Weights and Datasets:
Provides both UniPixel-3B and UniPixel-7B model weights, along with 23 datasets covering grounding, segmentation, and QA tasks, including raw images/videos and preprocessed annotations for research and application.
Training and Evaluation Support:
The codebase supports training and evaluation across 23 datasets and benchmarks, flexible hardware configurations, efficient training techniques, customizable LLM backbones and dialogue templates, and progress monitoring via TensorBoard/Wandb.

Project Links

Official Website: https://polyu-chenlab.github.io/unipixel/
GitHub Repository: https://github.com/PolyU-ChenLab/UniPixel
HuggingFace Dataset: https://huggingface.co/datasets/PolyU-ChenLab/UniPixel-SFT-1M
arXiv Paper: https://arxiv.org/pdf/2509.18094
Online Demo: https://huggingface.co/spaces/PolyU-ChenLab/UniPixel

Application Scenarios

Image Segmentation:
Generates pixel-level masks for specific objects based on natural language descriptions, applicable to tasks such as medical image analysis and object segmentation in autonomous driving.
Video Segmentation:
Performs real-time object segmentation in videos, suitable for applications like video editing, surveillance, and augmented reality.
Regional Understanding:
Identifies and segments specific regions in videos based on language descriptions, enabling video content analysis, intelligent surveillance, and background segmentation in video conferencing.
Question Answering (PixelQA):
Supports question answering tasks combining visual and linguistic cues, applicable to education, intelligent assistants, and information retrieval.
Multimodal Interaction:
Provides natural and precise interaction between visual and linguistic inputs, useful in intelligent assistants, virtual reality, and game development.
Intelligent Surveillance:
Recognizes and segments specific objects or regions in real-time surveillance footage, enhancing the intelligence and automation of monitoring systems.