Insert Anything – An image insertion framework jointly launched by Zhejiang University, Harvard University and Nanyang Technological University
What is Insert Anything
Insert Anything is a context-aware image insertion framework developed jointly by researchers from Zhejiang University, Harvard University, and Nanyang Technological University. It enables seamless insertion of objects from reference images into target scenes, supporting a wide range of practical use cases such as artistic creation, realistic face replacement, movie scene composition, virtual try-on, accessory customization, and digital prop substitution. Trained on the AnyInsertion dataset containing 120K prompt-image pairs, Insert Anything can flexibly adapt to various insertion tasks, offering powerful support for creative content generation and virtual fitting applications.
Key Features of Insert Anything
-
Multi-Scenario Support: Capable of handling diverse image insertion tasks including person insertion, object insertion, and clothing insertion.
-
Flexible User Control: Offers both mask-guided and text-guided control modes. Users can manually draw masks or input text prompts to specify the insertion region and content.
-
High-Quality Output: Generates high-resolution, high-fidelity images while maintaining detail and style consistency of the inserted elements.
Technical Principles of Insert Anything
-
AnyInsertion Dataset: The framework is trained on the large-scale AnyInsertion dataset with 120K prompt-image pairs, covering tasks like person, object, and clothing insertion.
-
Diffusion Transformer (DiT): Utilizes a DiT-based multimodal attention mechanism to process both textual and visual inputs. DiT jointly models relationships among text, masks, and image patches, enabling flexible editing control.
-
Contextual Editing Mechanism: Uses a polyptych format (e.g., diptych for mask guidance and triptych for text guidance) to combine reference and target images, allowing the model to understand contextual information and produce natural insertions.
-
Semantic Guidance: Integrates image encoders (e.g., CLIP) and text encoders to extract semantic information, ensuring that inserted elements align with the style and semantics of the target scene.
-
Adaptive Cropping Strategy: Dynamically adjusts the crop region for small targets, ensuring sufficient contextual information and attention to detail are preserved for high-quality results.
Project Resources for Insert Anything
-
Project Page: https://song-wensong.github.io/insert-anything/
-
GitHub Repository: https://github.com/song-wensong/insert-anything
-
arXiv Paper: https://arxiv.org/pdf/2504.15009
Application Scenarios for Insert Anything
-
Artistic Creation: Rapidly combine different elements to spark creative ideas.
-
Virtual Try-On: Enable consumers to preview clothing, enhancing online shopping experiences.
-
Film and Visual Effects: Seamlessly insert virtual elements into scenes, reducing production costs.
-
Advertising Design: Quickly generate diverse creative advertisements to boost appeal.
-
Cultural Heritage Restoration: Virtually restore artifacts or architectural details for research and exhibition purposes.