GigaTok – HKU and ByteDance Jointly Launch a Visual Tokenizer for Autoregressive Image Generation

What is GigaTok

GigaTok is a visual tokenizer designed for autoregressive image generation, featuring a model size of 3 billion parameters. It addresses the reconstruction versus generation quality trade-off by employing semantic regularization to align tokenizer features with semantically consistent features from pre-trained visual encoders like DINOv2. This alignment effectively constrains latent space complexity during scaling. GigaTok utilizes a 1D tokenizer architecture to enhance scalability, prioritizes decoder expansion for efficient resource allocation, and introduces entropy loss to stabilize training in large-scale models.

Key Features of GigaTok

High-Quality Image Reconstruction: GigaTok scales visual tokenizers to 3 billion parameters, significantly improving image reconstruction quality. Semantic regularization aligns tokenizer features with pre-trained visual encoders (e.g., DINOv2) to prevent excessive latent space complexity during scaling.
Enhanced Downstream Generation Performance: GigaTok excels in downstream autoregressive generation tasks, resolving the traditional trade-off between reconstruction and generation quality through semantic regularization and optimized scaling strategies.
Optimized Representation Learning: By expanding the visual tokenizer and incorporating semantic regularization, GigaTok enhances representation learning quality in downstream autoregressive models, achieving notable improvements in linear probing accuracy.
Innovative Scaling Strategies: GigaTok introduces a 1D tokenizer architecture for better scalability, prioritizes decoder expansion to efficiently allocate computational resources, and employs entropy loss to stabilize training in large-scale models.

Technical Principles of GigaTok

Hybrid Architecture Design: GigaTok adopts a hybrid architecture combining CNN and Transformer components for efficient feature extraction and latent space encoding. The encoder uses CNN blocks for progressive downsampling, followed by Transformer layers and vector quantization to generate discrete latent codes. The decoder reconstructs images from these codes using Transformer layers and CNN decoders. Both 1D and 2D tokenizers are supported, with 1D tokenizers offering superior scalability.
Semantic Regularization: To manage latent space complexity during scaling, GigaTok introduces semantic regularization by aligning tokenizer features with semantically consistent features from pre-trained visual encoders like DINOv2. This alignment is enforced through a contrastive learning framework, maintaining generation quality as the model scales.
Asymmetric Scaling Strategy: GigaTok prioritizes decoder expansion over encoder expansion during scaling, enabling more efficient computational resource allocation and preventing uncontrolled latent space complexity due to overly complex encoders.
Entropy Loss: To stabilize training in large-scale tokenizers, GigaTok incorporates entropy loss, encouraging higher codebook utilization and ensuring model stability during training.

Project Links for GigaTok

Project Website: https://silentview.github.io/GigaTok/
GitHub Repository: https://github.com/SilentView/GigaTok
arXiv Technical Paper: https://arxiv.org/pdf/2504.08736

Application Scenarios for GigaTok

Image Generation and Synthesis: GigaTok demonstrates exceptional performance in autoregressive image generation, producing high-quality images suitable for art creation, game development, virtual reality, and other fields requiring rapid generation of tailored visual content.
Image Editing and Enhancement: GigaTok can be applied to image editing tasks, such as seamlessly integrating foreground objects into background images.
Data Augmentation and Pretraining: With its efficient image tokenization and reconstruction capabilities, GigaTok provides high-quality pretraining data for machine learning models.
Multimodal Learning: GigaTok’s semantic regularization enables integration with text generation models, facilitating text-to-image generation for applications in intelligent creation and virtual assistants.
Medical Image Processing: GigaTok’s high-fidelity image reconstruction capabilities can be utilized in medical image generation and processing, such as producing high-quality medical images for diagnostics or research.