UniTok – A Unified Visual Tokenizer Jointly Launched by ByteDance, the University of Hong Kong and Huazhong University of Science and Technology

What is UniTok?

UniTok is a unified visual tokenizer jointly developed by ByteDance, The University of Hong Kong, and Huazhong University of Science and Technology. It supports both visual generation and understanding tasks. Based on multi-codebook quantization, UniTok divides visual features into multiple segments, each quantized with an independent sub-codebook, dramatically enhancing the expressiveness of discrete tokens. This approach resolves the trade-off between capturing fine details and understanding high-level semantics found in traditional tokenizers.

UniTok achieves 78.6% zero-shot classification accuracy on ImageNet and a reconstruction quality (rFID) of just 0.38, significantly outperforming existing tokenizers. Multimodal large language models (MLLMs) built on top of UniTok excel in tasks like visual question answering and image generation, demonstrating strong potential in multimodal applications.

Key Features of UniTok

Unified Visual Representation:
Converts images into discrete visual tokens that can be used in both image generation (e.g., text-to-image) and visual understanding tasks (e.g., visual question answering).
High-Quality Image Reconstruction:
Efficiently reconstructs images while preserving fine visual details.
Semantic Alignment:
Combines contrastive learning and reconstruction loss to align visual tokens with textual descriptions, improving semantic understanding.
Supports Multimodal Large Language Models (MLLMs):
Serves as the visual input module for MLLMs, enabling unified processing and generation across vision and language modalities.

Technical Principles of UniTok

Multi-codebook Quantization:
UniTok divides a visual feature vector (e.g., 64-dimensional) into smaller chunks (e.g., eight 8-dimensional segments). Each segment is quantized using a separate sub-codebook with 4,096 codewords. This exponentially increases the vocabulary size and enhances the representational power of the discrete tokens.
Attention-based Decomposition:
Replaces traditional linear projection layers with multi-head attention modules for token decomposition. This better preserves the semantic information of the original features. UniTok uses causal attention to ensure compatibility with autoregressive generation tasks.
Unified Training Objective:
Built on a VQ-VAE backbone, UniTok uses reconstruction loss to ensure fine detail preservation in image recovery. The loss function includes:
- Pixel-level reconstruction error
- Perceptual loss
- Discriminator loss
- Vector quantization loss
Additionally, UniTok introduces a CLIP-style contrastive loss to align visual tokens with textual descriptions. The total loss is a weighted sum of reconstruction and contrastive losses, allowing UniTok to optimize for both generation and understanding tasks simultaneously.
MLLM Integration:
The generated visual tokens are projected into the token space of a multimodal large language model using a multi-layer perceptron (MLP). To simplify MLLM inputs, tokens from multiple sub-codebooks are merged into a single visual token. When generating visual outputs, the MLLM autoregressively predicts the sub-codebook tokens for the next visual token position, enabling efficient visual synthesis.

UniTok Project Links

Official Website: https://foundationvision.github.io/UniTok/
GitHub Repository: https://github.com/FoundationVision/UniTok
HuggingFace Model Hub: https://huggingface.co/FoundationVision/unitok_tokenizer
arXiv Technical Paper: https://arxiv.org/pdf/2502.20321

Application Scenarios of UniTok

Visual Input for Multimodal Models:
Acts as the visual module in MLLMs, enabling simultaneous processing of text and image data to improve overall performance.
High-Quality Image Generation:
Generates detailed images based on text descriptions, useful for creative design, advertising, and more.
Visual Question Answering and Understanding:
Helps models understand image content and answer vision-related questions, with applications in education, medical image analysis, and beyond.
Multimodal Content Creation:
Quickly generates coherent text-image content for use in news, social media, and other creative domains, boosting content production efficiency.
Cross-Modal Retrieval and Recommendation:
Facilitates search and recommendation based on either text or images, enhancing user experiences on e-commerce and multimedia platforms.