PixelFlow – An Image Generation Model Jointly Launched by HKU and Adobe
What is PixelFlow?
PixelFlow is an image generation model jointly developed by the University of Hong Kong and Adobe, supporting direct image generation in pixel space. PixelFlow is based on efficient cascaded flow modeling, progressively scaling up from low resolution to high resolution to reduce computational costs. PixelFlow achieves an FID score of 1.98 on the 256×256 ImageNet class-conditional image generation task, demonstrating excellent image quality and semantic control capabilities. PixelFlow also performs outstandingly in text-to-image generation tasks, supporting the creation of high-quality images that are highly consistent with textual descriptions. Its end-to-end trainability and efficient multi-scale generation strategy provide new research directions for the next generation of visual generation models.

The main functions of PixelFlow
- High-quality Image Generation: Supports the generation of high-resolution and high-quality images.
- Conditional Image Generation by Class: Generates corresponding images based on given class labels.
- Text-to-Image Generation: Generates images that match the text description, supporting complex semantic understanding and visual representation.
The Technical Principle of PixelFlow
- Flow Matching: Flow matching is a generative model technique that progressively transforms samples from a prior distribution (such as the standard normal distribution) into samples from the target data distribution through a series of linear paths. During training, training samples are constructed based on linear interpolation, and the model is trained to predict the conversion speed from intermediate samples to real data samples.
- Multi-scale Generation: This approach gradually increases the image resolution through a multi-stage denoising process. Each stage starts with a low-resolution noisy image and progressively denoises and upscales the resolution until the target resolution is reached. The step-by-step resolution enhancement avoids performing all denoising steps at full resolution, significantly reducing computational costs.
- Transformer Architecture:
◦ Patchify: Converts the spatial representation of the input image into 1D sequence tokens.
◦ RoPE (Rotary Position Embedding): Replaces the original sine-cosine position encoding to better handle different image resolutions.
◦ Resolution Embedding: Introduces additional resolution embeddings to distinguish between different resolutions.
◦ Text-to-Image Generation: Introduces cross-attention layers in each Transformer block to align visual features with text input. - End-to-end training: Directly train in the pixel space based on a unified parameter set, eliminating the need for a pre-trained VAE or other auxiliary networks. During training, the model uniformly samples training samples from all resolution stages and employs sequence packing techniques for joint training, improving training efficiency and the scalability of the model.
- Efficient Inference Strategy: During inference, PixelFlow starts from Gaussian noise at the lowest resolution and progressively denoises and upscales the resolution until the target resolution is reached. It supports various ODE solvers (such as Euler and Dopri5), allowing you to choose different solvers as needed to balance speed and generation quality.
The project address of PixelFlow
- GitHub Repository: https://github.com/ShoufaChen/PixelFlow
- arXiv Research Paper: https://arxiv.org/pdf/2504.07963
- Online Demo Experience: https://huggingface.co/spaces/ShoufaChen/PixelFlow
Application scenarios of PixelFlow
- Art and Design: Generate creative paintings, graphic design elements, and virtual characters.
- Content Creation: Assist in video production, game development, and social media content creation.
- Education and Research: Serve as a teaching tool to help understand complex concepts and assist in scientific research visualization.
- Business and Marketing: Generate product design prototypes, advertising images, and brand promotion content.
- Entertainment and Interaction: Used in interactive stories, VR/AR content generation, and personalized image customization.