Direct3D – S2: A High – Resolution 3D Generation Framework Jointly Launched by Nanjing University and Fudan University and Other Universities

What is Direct3D-S2?

Direct3D-S2 is a high-resolution 3D generation framework jointly developed by researchers from Nanjing University, DreamTech, Fudan University, and the University of Oxford. Based on sparse volumetric representation and an innovative Spatial Sparse Attention (SSA) mechanism, it significantly improves the computational efficiency of diffusion transformers (DiT) and reduces training costs. The framework features an end-to-end Sparse SDF Variational Autoencoder (SS-VAE) with a symmetric encoder-decoder architecture that supports multi-resolution training and enables 1024³ resolution training using just 8 GPUs. Direct3D-S2 outperforms existing methods in both generation quality and efficiency, offering powerful support for high-resolution 3D content creation.

Direct3D - S2: A High - Resolution 3D Generation Framework Jointly Launched by Nanjing University and Fudan University and Other Universities

Key Features of Direct3D-S2

High-resolution 3D shape generation: Generates high-resolution 3D shapes (up to 1024³) from images, delivering detailed geometry and high visual fidelity.
Efficient training and inference: Greatly enhances the computational efficiency of diffusion transformers while reducing training costs—training at 1024³ resolution requires only 8 GPUs.
Image-conditioned 3D generation: Capable of generating 3D models conditioned on input images, ensuring high correspondence between the 2D and 3D domains.

Technical Highlights of Direct3D-S2

Spatial Sparse Attention (SSA) Mechanism:
- Divides input tokens based on 3D spatial coordinates and uses sparse 3D convolutions and pooling to extract block-level global information, reducing token count and increasing computational efficiency.
- Based on attention scores from a compression module, important blocks are selected for fine-grained feature extraction, optimizing resource usage.
- Local window operations are used to inject local features, enhancing local feature interactions and improving generation quality.
- A gating mechanism predicts weights to aggregate outputs from all modules into a final attention result.
Sparse SDF Variational Autoencoder (SS-VAE):
- Combines sparse 3D convolutional networks and Transformer architectures to encode high-resolution sparse SDF volumes into latent sparse representations.
- The decoder reconstructs the SDF volume, and during training, different resolutions are sampled to improve adaptability, training efficiency, and generalization.
Image-conditioned Diffusion Transformer (SS-DiT):
- Extracts sparse foreground tokens from input images to minimize background interference, ensuring higher consistency between the input and generated 3D models.
- Trained using Conditional Flow Matching (CFM) to predict the velocity field from noise to data distribution, enabling efficient 3D shape generation.

Project Links

Project Website: https://nju-3dv.github.io/projects/Direct3D-S2/
GitHub Repository: https://github.com/DreamTechAI/Direct3D-S2
arXiv Paper: https://arxiv.org/pdf/2505.17412
Live Demo: https://huggingface.co/spaces/wushuang98/Direct3D-S2

Application Scenarios

Virtual Reality (VR) and Augmented Reality (AR):
Build lifelike 3D environments and personalized avatars, integrate with real-world scenes for education and cultural heritage preservation.
Game Development:
Rapidly generate high-quality 3D game assets and real-time 3D content based on user interaction or game logic.
Product Design and Prototyping:
Quickly generate 3D product models for virtual displays and meet customization needs in industrial design.
Film and Animation Production:
Generate high-fidelity 3D characters, create virtual scenes, and produce complex 3D visual effects.
Education and Training:
Develop virtual labs, 3D educational models, and immersive environments for vocational training and skill development.