Skywork UniPic 2.0 – Kunlun Wanwei’s Open-Source Unified Multimodal Model

What is Skywork UniPic 2.0？

Skywork UniPic 2.0 is an efficient multimodal model open-sourced by Kunlun Wanwei, focusing on unified capabilities for image generation, editing, and understanding. The model is built on a 2-billion-parameter SD3.5-Medium architecture and achieves collaborative optimization of generation and editing tasks through pretraining, a progressive dual-task reinforcement learning strategy, and joint training. Its performance surpasses several large-parameter models. The model supports text-to-image generation, image editing, and multimodal understanding, featuring lightweight efficiency and flexible switching to help developers quickly build multimodal applications.

Main Features of Skywork UniPic 2.0

Image Generation: Generates high-quality images based on user text descriptions, supporting various styles and scenarios.
Image Editing: Performs content modification, style transfer, and other edits on existing images to meet diverse editing needs.
Multimodal Understanding: Understands image content and answers related questions, supporting complex instruction execution and content modification.

Technical Principles of Skywork UniPic 2.0

Architecture Design: Based on a 2-billion-parameter SD3.5-Medium architecture, supporting text-to-image generation and image editing tasks. By freezing the raw image editing module and combining it with multimodal models (such as Qwen2.5-VL-7B) and connectors, it constructs an integrated model for understanding, generation, and editing.
Pretraining: Pretrained on large-scale, high-quality image generation and editing datasets, equipping the model with fundamental generation and editing abilities. Using a text encoder and VAE encoder, text and images serve as conditional inputs to enhance multimodal understanding.
Reinforcement Learning: Employs a progressive dual-task reinforcement learning strategy based on the Flow-GRPO framework to separately optimize generation and editing tasks, avoiding interference between tasks and improving overall model performance.
Joint Training: Aligns the multimodal model with the raw image editing module through connectors during pretraining. On the basis of connector pretraining, the connector and raw image editing module are jointly trained to further improve performance.

Project Links for Skywork UniPic 2.0

Official Website: https://unipic-v2.github.io/
GitHub Repository: https://github.com/SkyworkAI/UniPic/tree/main/UniPic-2
HuggingFace Model Hub: https://huggingface.co/collections/Skywork/skywork-unipic2-6899b9e1b038b24674d996fd
Technical Paper: https://github.com/SkyworkAI/UniPic/blob/main/UniPic-2/assets/pdf/UNIPIC2.pdf

Application Scenarios of Skywork UniPic 2.0

Creative Design: Quickly generate advertisements, posters, or illustrations to help designers realize creative ideas fast.
Content Creation: Generate key frames, characters, or scenes for videos, animations, or game development to speed up the creative process.
Education: Generate related images or animations based on teaching content to assist instruction and enhance student engagement.
Entertainment: Produce personalized social media images or virtual reality scenes to enrich user experience.
Commercial Use: Generate product concept images, packaging designs, or marketing visuals to accelerate commercial project progress.