Qwen-Image-Edit – A versatile image editing model launched by Alibaba Tongyi

What is Qwen-Image-Edit?

Qwen-Image-Edit is a versatile image editing model built on the 20-billion-parameter Qwen-Image architecture. The model combines both semantic and appearance editing capabilities, enabling low-level visual appearance edits (such as adding, deleting, or modifying elements) as well as high-level semantic edits (such as IP creation, object rotation, and style transfer). It supports precise bilingual (Chinese and English) text editing in images, allowing modifications while preserving the original font, size, and style. Qwen-Image-Edit has demonstrated outstanding results across multiple public benchmarks, achieving state-of-the-art (SOTA) performance, and can be experienced via Qwen Chat.

Qwen-Image-Edit – Key Features

Semantic Editing: Modify image content while preserving overall visual semantics.
Appearance Editing: Precisely edit specific regions of an image, such as adding, deleting, or modifying elements, while keeping other areas unchanged.
Accurate Text Editing: Support for bilingual Chinese and English text editing in images, including adding, deleting, or modifying text, while retaining the original font, size, and style.
Strong Benchmark Performance: Achieves state-of-the-art (SOTA) results across various public benchmarks, efficiently handling complex image editing tasks.

Technical Principles of Qwen-Image-Edit

Model Architecture: Qwen-Image-Edit is further trained on the 20B-parameter Qwen-Image model, inheriting its strong text rendering and image generation capabilities. Input images are processed by two modules:
- Qwen2.5-VL: Responsible for semantic control, understanding the semantic content of the image and enabling semantic-level edits.
- VAE Encoder: Responsible for appearance control, accurately handling visual details and enabling localized edits.
Semantic and Appearance Editing: Through Qwen2.5-VL, the model ensures semantic consistency while modifying content. Through the VAE Encoder, it handles fine-grained visual edits such as adding, removing, or modifying localized elements.
Text Editing: Optimized for text rendering, the model can accurately detect and edit text in images, supporting bilingual (Chinese and English) operations while preserving font, size, and style.
Chained Editing: Supports step-by-step refinement, allowing complex edits through progressive adjustments. Users can specify areas for modification, and the model iteratively optimizes until the desired outcome is achieved.

Project Links for Qwen-Image-Edit

Official Website: https://qwenlm.github.io/blog/qwen-image-edit/
GitHub Repository: https://github.com/QwenLM/Qwen-Image
HuggingFace Model Hub: https://huggingface.co/Qwen/Qwen-Image-Edit
Online Demo: https://huggingface.co/spaces/Qwen/Qwen-Image-Edit

Application Scenarios for Qwen-Image-Edit

Creative Design: Quickly generate and modify virtual character appearances, outfits, and backgrounds, enabling diverse IP creation.
Advertising & Poster Design: Modify text directly in posters—including font, size, and color adjustments—without redesigning, improving efficiency.
Film & Video Production: Efficiently adjust scene elements or character appearances in post-production, or transform video styles (e.g., from realistic to anime).
Education & Training: Easily generate or modify educational images and diagrams (e.g., historical portraits, scientific illustrations), enhancing teaching effectiveness.
Personal Use: Edit personal photos effortlessly, such as changing backgrounds, adding decorative elements, or modifying outfits to create customized images.