SceneGen – 3D Scene Generation Framework Developed by Shanghai Jiao Tong University
What is SceneGen?
SceneGen is an efficient open-source 3D scene generation framework developed by a research team at Shanghai Jiao Tong University. Starting from a single scene image and its corresponding object segmentation mask, SceneGen can directly generate a complete 3D scene—including geometry, texture, and spatial layout—through a single forward pass. Its innovation lies in an end-to-end generation pipeline that eliminates the need for time-consuming optimization or asset retrieval and assembly, greatly improving generation efficiency.
At its core, SceneGen integrates local and global scene aggregation modules and introduces a position prediction head that simultaneously predicts 3D assets and their relative spatial positions, ensuring both physical plausibility and visual consistency. The framework is designed for applications in VR/AR, embodied AI, game development, and interior design, offering a powerful solution for rapidly constructing realistic virtual environments.
Key Features of SceneGen
-
Single-Image-to-3D Scene Generation:
Generates a complete 3D scene (geometry, texture, spatial layout) from a single scene image and its segmentation mask. -
Efficient End-to-End Generation:
Produces full 3D scenes in one forward pass without iterative optimization or asset retrieval, significantly boosting efficiency. -
Local and Global Information Aggregation:
Incorporates aggregation modules during feature extraction to effectively combine local details with global context, ensuring realistic and consistent scene generation. -
Joint Asset and Position Prediction:
Uses a unique position head to jointly predict both 3D assets and their precise spatial positions within the scene. -
High Accuracy and Realism:
Outperforms prior methods in geometric precision, texture fidelity, and overall visual quality on both synthetic and real-world datasets.
Technical Overview
-
Input Processing and Feature Extraction:
Takes a single scene image and its object segmentation mask as input. Visual and geometric encoders extract object-level and global scene-level features respectively. -
Local Texture Refinement:
A pretrained local attention module enhances and refines object texture details to ensure visual realism. -
Global Feature Fusion:
A global attention (aggregation) module fuses object-level and scene-level information, capturing spatial relationships and contextual dependencies between objects. -
Joint Decoding and Generation:
A structure decoder processes the fused features while the position head predicts the relative spatial positions of assets, enabling simultaneous generation of geometry, texture, and layout. -
End-to-End Optimization:
The entire process completes in a single forward pass without iterative optimization or external asset retrieval, achieving high efficiency and realism across datasets.
Project Links
-
Official Website: https://mengmouxu.github.io/SceneGen/
-
GitHub Repository: https://github.com/mengmouxu/scenegen
-
Hugging Face Model Hub: https://huggingface.co/haoningwu/scenegen
-
arXiv Paper: https://arxiv.org/pdf/2508.15769
Application Scenarios
-
Game and Film Production:
Rapidly generates production-ready 3D environments from concept art or reference photos, reducing scene modeling time—particularly valuable for indie developers and small studios. -
Virtual and Augmented Reality (VR/AR):
Efficiently creates realistic, interactive 3D worlds to support applications in VR/AR and embodied AI, where large-scale high-fidelity environments are essential. -
Real Estate and Interior Design:
Converts 2D floor plans or real-world photos into interactive 3D walkthroughs, helping developers, agents, and clients visualize spatial layouts and design aesthetics. -
Simulation and Training Environments:
Provides efficient scene generation for applications such as autonomous driving or robot navigation that require large quantities of realistic virtual training environments.