SceneGen – 3D Scene Generation Framework Developed by Shanghai Jiao Tong University

What is SceneGen？

SceneGen is an efficient open-source 3D scene generation framework developed by a research team at Shanghai Jiao Tong University. Starting from a single scene image and its corresponding object segmentation mask, SceneGen can directly generate a complete 3D scene—including geometry, texture, and spatial layout—through a single forward pass. Its innovation lies in an end-to-end generation pipeline that eliminates the need for time-consuming optimization or asset retrieval and assembly, greatly improving generation efficiency.

At its core, SceneGen integrates local and global scene aggregation modules and introduces a position prediction head that simultaneously predicts 3D assets and their relative spatial positions, ensuring both physical plausibility and visual consistency. The framework is designed for applications in VR/AR, embodied AI, game development, and interior design, offering a powerful solution for rapidly constructing realistic virtual environments.

Key Features of SceneGen

Single-Image-to-3D Scene Generation:
Generates a complete 3D scene (geometry, texture, spatial layout) from a single scene image and its segmentation mask.
Efficient End-to-End Generation:
Produces full 3D scenes in one forward pass without iterative optimization or asset retrieval, significantly boosting efficiency.
Local and Global Information Aggregation:
Incorporates aggregation modules during feature extraction to effectively combine local details with global context, ensuring realistic and consistent scene generation.
Joint Asset and Position Prediction:
Uses a unique position head to jointly predict both 3D assets and their precise spatial positions within the scene.
High Accuracy and Realism:
Outperforms prior methods in geometric precision, texture fidelity, and overall visual quality on both synthetic and real-world datasets.

Technical Overview

Input Processing and Feature Extraction:
Takes a single scene image and its object segmentation mask as input. Visual and geometric encoders extract object-level and global scene-level features respectively.
Local Texture Refinement:
A pretrained local attention module enhances and refines object texture details to ensure visual realism.
Global Feature Fusion:
A global attention (aggregation) module fuses object-level and scene-level information, capturing spatial relationships and contextual dependencies between objects.
Joint Decoding and Generation:
A structure decoder processes the fused features while the position head predicts the relative spatial positions of assets, enabling simultaneous generation of geometry, texture, and layout.
End-to-End Optimization:
The entire process completes in a single forward pass without iterative optimization or external asset retrieval, achieving high efficiency and realism across datasets.

Project Links

Official Website: https://mengmouxu.github.io/SceneGen/
GitHub Repository: https://github.com/mengmouxu/scenegen
Hugging Face Model Hub: https://huggingface.co/haoningwu/scenegen
arXiv Paper: https://arxiv.org/pdf/2508.15769

Application Scenarios

Game and Film Production:
Rapidly generates production-ready 3D environments from concept art or reference photos, reducing scene modeling time—particularly valuable for indie developers and small studios.
Virtual and Augmented Reality (VR/AR):
Efficiently creates realistic, interactive 3D worlds to support applications in VR/AR and embodied AI, where large-scale high-fidelity environments are essential.
Real Estate and Interior Design:
Converts 2D floor plans or real-world photos into interactive 3D walkthroughs, helping developers, agents, and clients visualize spatial layouts and design aesthetics.
Simulation and Training Environments:
Provides efficient scene generation for applications such as autonomous driving or robot navigation that require large quantities of realistic virtual training environments.