BlenderFusion – A generative visual synthesis framework developed by Google DeepMind

What is BlenderFusion？

BlenderFusion is a generative visual synthesis framework developed by Google DeepMind that integrates traditional 3D editing software (Blender) with AI models to enable precise geometric manipulation and diverse visual composition. The framework operates in three stages: first, it extracts objects of interest from a source image and converts them into editable 3D elements (Object-centric Layering); second, it enables diverse object editing within Blender (Blender-grounded Editing); and finally, it seamlessly fuses the edited elements using a generative compositor to produce photorealistic final images (Generative Compositing). BlenderFusion excels at complex visual synthesis tasks, supporting flexible, decoupled, and 3D-aware control over objects, cameras, and backgrounds.

Key Features of BlenderFusion

Precise 3D Geometric Control: Enables detailed editing of objects in 3D using Blender, including position, rotation, scaling, and property adjustments such as color, material, and shape.
Flexible Camera Control: Supports independent camera manipulation, allowing for complex viewpoint changes without altering object positions.
Complex Scene Composition: Seamlessly integrates edited objects with backgrounds to generate realistic final images. Supports multi-object operations and complex scene editing.
Decoupled Object and Camera Manipulation: Allows users to modify objects while keeping the camera fixed, or adjust the camera while keeping objects unchanged—achieving highly decoupled control.
Generalization Capability: Can be applied to unseen objects and scenes, supporting both simple and complex editing tasks including progressive, multi-step edits.

Technical Foundations of BlenderFusion

Object-centric Layering: Uses visual foundation models (such as SAM2 for segmentation and Depth Pro for depth estimation) to extract objects from input images and convert them into editable 3D elements. Optionally integrates image-to-3D models (like Rodin or Hunyuan3D) to generate full 3D meshes, aligned with 2.5D surface meshes for flexible testing-time edits.
Blender-grounded Editing: Imports the layered 3D objects into Blender to perform a wide range of editing operations—basic transformations, property changes, non-rigid deformations, and more. Also supports camera manipulation and background replacement, offering precise 3D control signals for the compositing stage.
Generative Compositing: A diffusion-based generative compositor fuses Blender renderings with backgrounds to create the final realistic image. The compositor features a dual-stream architecture that processes both the original and target scenes, using cross-view attention mechanisms to merge their information. Training techniques such as source masking and simulated object jittering enhance flexibility and decoupling in complex editing tasks.

Project Links

Official Website: https://blenderfusion.github.io/
Technical Paper on arXiv: https://arxiv.org/pdf/2506.17450

Application Scenarios of BlenderFusion

Film and TV Production: Used in visual effects (VFX) for movies and shows—adding virtual objects, adjusting scene layouts, or changing backgrounds to create realistic composite scenes.
Game Development: Assists developers in quickly designing and editing game environments—adding and modifying objects, adjusting camera angles, and generating lifelike in-game visuals.
Advertising: Helps ad designers create high-quality product visuals that highlight key features and appeal to audiences.
Architectural Visualization: Allows architects and interior designers to visualize indoor layouts—adding or modifying furniture, decorations, and generating photorealistic interior renderings.
Artistic Creation: Empowers artists to produce unique digital artworks using 3D editing and generative composition, enabling creative visual storytelling.