ShareGPT-4o-Image: Unlocking the Power of GPT-4o for Multimodal Image Generation and Editing

What is ShareGPT-4o-Image?

ShareGPT-4o-Image is a large-scale open-source multimodal image generation dataset released by the FreedomIntelligence team. It contains over 92,000 samples generated by the GPT-4o model, including text-to-image and text-plus-image-to-image samples. Based on this dataset, the team fine-tuned the Janus-4o model, a multimodal large language model (MLLM) that supports both text and image inputs, with powerful image generation and editing capabilities. The project aims to advance the field of multimodal AI by providing high-quality data and model support for image generation and interactive editing.

Key Features:

Large-Scale Multimodal Dataset
- Includes 92,256 samples: 45,717 for text-to-image generation, and 46,539 for text-and-image-to-image generation.
Janus-4o Model
- Fine-tuned on the ShareGPT-4o-Image dataset, supporting both text-to-image and text-plus-image-to-image generation tasks.
- Equipped with image editing abilities, allowing users to modify image details based on instructions.
Multimodal Input Fusion
- Combines text descriptions and existing image inputs for flexible and interactive image generation.

Technical Principles:

Data Collection and Construction
- Massive high-quality text-image pairs generated by the GPT-4o-Image model covering diverse scenarios and styles to ensure data richness and representativeness.
- Data is carefully filtered and standardized to ensure quality and effective model training.
Multimodal Model Architecture Design
- Built upon the Janus-Pro model architecture, integrating Transformer-based text and visual encoders to convert both text and image inputs into unified representations.
- The model can handle pure text input as well as combined text and image input, enabling diverse image generation and editing functionalities.
Fine-tuning and Training Strategy
- Janus-Pro is fine-tuned on the ShareGPT-4o-Image dataset using supervised learning to optimize performance on image generation tasks.
- Special training samples for image editing are incorporated, teaching the model to understand and respond to image modification instructions.
Multimodal Fusion and Generation Mechanism
- Uses cross-modal attention mechanisms internally to effectively fuse textual and visual information, generating images that align with semantic and visual requirements.
- Employs a decoder to progressively generate high-quality images, preserving detail and overall contextual consistency.
Image Editing Capability Implementation
- Training includes image editing-related tasks, enabling the model to recognize user editing requests (e.g., color adjustments, adding details).
- Supports interaction modes where new images are generated based on existing images, enhancing user experience and flexibility.

Project Links

GitHub Repository: https://github.com/FreedomIntelligence/ShareGPT-4o-Image
Hugging Face Dataset: https://huggingface.co/datasets/FreedomIntelligence/ShareGPT-4o-Image

Application Scenarios:

Multimodal AI Assistants
- Enables interactive image generation and editing through natural language and image inputs.
Creative Design & Content Generation
- Assists designers with rapid concept sketches and ideation during the creative process.
Educational Assistance
- Generates instructional illustrations and diagrams to support teaching visualization.
Digital Content Creation
- Provides rich image resources for articles, videos, and multimedia content creators.
Interactive Image Editing
- Allows users to edit existing images via simple instructions, suitable for online design tools and social media applications.