Z-Image – An Image Generation Model Developed by Alibaba’s Tongyi

What is Z-Image?

Z-Image is an image generation model developed by Alibaba’s Tongyi team, featuring 6B parameters. The model comes in three variants—Z-Image-Turbo, Z-Image-Base, and Z-Image-Edit—each optimized for fast inference, foundation development, and image editing respectively. Built on a single-stream DiT architecture, it supports bilingual (Chinese and English) text rendering and can generate or edit high-quality images based on natural language instructions. With decoupled DMD and DMDR technologies, Z-Image delivers strong performance and image quality, making it suitable for a wide range of creative applications.

Main Features of Z-Image

Efficient Image Generation

Z-Image can rapidly generate high-quality and realistic images, applicable to multiple scenarios such as creative design, artistic production, and virtual content creation.

Bilingual Text Rendering

It supports both Chinese and English text rendering and can accurately generate images that contain complex textual elements, making it useful in multilingual image generation tasks.

Creative Image Editing

Through the Z-Image-Edit variant, users can precisely edit images based on natural language instructions, enabling creative transformations and stylistic adjustments.

Low-Resource Adaptation

The Z-Image-Turbo version is optimized for inference efficiency and can run quickly on low-resource devices (e.g., consumer-grade GPUs), making it suitable for both enterprise and consumer applications.

Community-Driven Development

It provides a foundation model (Z-Image-Base) that enables developers to perform fine-tuning and custom development to meet diverse needs.

Technical Principles of Z-Image

Single-Stream Diffusion Transformer Architecture (S3-DiT)

Z-Image uses a single-stream diffusion transformer architecture that connects text tokens, visual semantic tokens, and image VAE tokens at the sequence level into a unified input stream. Compared with dual-stream architectures, this greatly improves parameter efficiency and reduces computation cost.

Decoupled DMD (Distribution Matching Distillation)

With decoupled DMD, the CFG Augmentation (CA) and Distribution Matching (DM) mechanisms are separated and optimized independently. This significantly boosts performance in few-step generation, enabling highly efficient image production.

DMDR (DMD + Reinforcement Learning)

By integrating reinforcement learning (RL) with distribution matching distillation (DMD), Z-Image further enhances semantic alignment, aesthetic quality, and structural coherence, resulting in higher-quality images.

Optimized Inference Performance

The model supports technologies such as Flash Attention and model compilation, further accelerating inference, reducing latency, and improving efficiency in real-world applications.

Multilingual Understanding and Generation

Through multimodal pretraining and fine-tuning, Z-Image can understand and generate images containing both Chinese and English text, supporting cross-language image creation tasks.