UniWorld V2 – an image editing model jointly developed by Tuzhan Intelligence and Peking University

What is UniWorld V2？

UniWorld V2 is a next-generation image editing model jointly developed by Tuzhan Intelligence and Peking University’s UniWorld team. It adopts the innovative UniWorld-R1 training framework and is the first to apply reinforcement learning–based policy optimization to image editing. Leveraging DiffusionNFT technology, the model achieves highly efficient training. It uses a multimodal large language model as a reward model to provide stable and fine-grained feedback, and introduces a low-variance group filtering mechanism to enhance training stability.UniWorld V2 can accurately understand and render complex Chinese typography, supports precise spatial control (such as editing only within a drawn bounding box), and achieves global lighting consistency, producing more natural and coherent results. It has achieved leading performance on industry benchmarks such as GEdit-Bench and ImgEdit, surpassing all existing public models.

Main Features of UniWorld V2

Accurate Chinese Text Rendering:
Understands and generates complex artistic Chinese characters—such as “月满中秋”—with clarity and semantic accuracy, enabling easy text modification through simple instructions.

Fine-grained Spatial Control:
Supports region-specific editing using bounding boxes. For example, with an instruction like “move the bird out of the red box,” the model can strictly follow spatial constraints to perform high-difficulty edits.

Global Lighting Fusion:
Accurately interprets lighting instructions—such as “relight the scene”—to ensure objects blend naturally into the environment with consistent lighting and shadows.

Instruction Alignment & Quality Enhancement:
Delivers strong performance in instruction adherence and overall image quality. Users consistently prefer its outputs, especially for tasks requiring precise instruction following.

Multi-model Compatibility:
The framework is model-agnostic and can be applied to various base models such as Qwen-Image-Edit and FLUX-Kontext, significantly boosting their performance.

Technical Principles of UniWorld V2

Innovative Training Framework:
Utilizes the UniWorld-R1 framework, pioneering the application of reinforcement learning–based policy optimization in image editing. With Diffusion Negative-aware Finetuning (DiffusionNFT), it achieves policy optimization without likelihood estimation, improving training efficiency.

Multimodal Reward Model:
Uses a multimodal large language model (MLLM) as a reward model, directly leveraging its log-value outputs for fine-grained feedback while avoiding the computational cost and bias associated with complex reasoning or sampling.

Low-variance Group Filtering:
Addresses low-variance group issues in reward normalization by introducing a filtering strategy based on reward mean and variance to eliminate high-mean, low-variance sample groups, stabilizing the training process.

Model-agnostic Design:
The framework is designed to be independent of any specific model and can be applied to multiple foundational image editing models such as Qwen-Image-Edit and FLUX-Kontext, making it widely adaptable.

Project Resources for UniWorld V2

GitHub Repository: https://github.com/PKU-YuanGroup/Uniworld
arXiv Paper: https://arxiv.org/pdf/2510.16888

Application Scenarios of UniWorld V2

Image Editing & Design:
Performs precise image edits based on user instructions—such as modifying text, adjusting object positions, or changing scene lighting—making it suitable for poster design, advertising creativity, and visual arts.

Content Creation & Generation:
Helps creators quickly produce image content that meets specific requirements, improving production efficiency in video creation, animation, and game development.

Product Display & Marketing:
Enhances product visuals through editing, such as adding effects, adjusting backgrounds, or optimizing lighting, ideal for e-commerce product showcases and brand marketing.

Education & Training:
Serves as a teaching tool for learning image editing techniques and can generate educational materials such as textbook illustrations and instructional slides.

Research & Experimentation:
Generates simulated image data to support research applications, such as producing controlled image samples for medical imaging, environmental science, and other fields.