TripoSF – The new generation of 3D foundational model launched by VAST AI

What is TripoSF?

TripoSF is a new generation of 3D foundational model launched by VAST, breaking through the bottlenecks of traditional 3D modeling in terms of detail, complex structures, and scalability. By adopting the SparseFlex representation method and combining sparse voxel structures, it stores and computes voxel information only in the regions near the object’s surface, significantly reducing memory usage and enabling high-resolution training and inference. TripoSF introduces a “frustum-aware partitioned voxel training” strategy, further reducing training costs. Experiments show that TripoSF performs excellently in multiple benchmark tests, with Chamfer Distance reduced by approximately 82% and F-score improved by around 88%.

The main functions of TripoSF

Detail Capture Capability: Traditional 3D modeling methods often struggle with capturing details. TripoSF, however, can capture fine surface details and microscopic structures. In multiple standard benchmark tests, TripoSF achieved approximately an 82% reduction in Chamfer Distance and about an 88% improvement in F-score.
Topological Structure Support: TripoSF natively supports arbitrary topology, enabling natural representation of open surfaces and internal structures. This gives TripoSF a clear advantage when handling complex structures such as fabrics and blades.
Computational Resource Requirements: By leveraging sparse voxel structures, TripoSF significantly reduces memory usage. This makes TripoSF more efficient in high-resolution modeling and less demanding on computational resources.
Real-Time Rendering Capability: TripoSF’s frustum-aware training strategy enhances its adaptability in dynamic and complex environments. TripoSF can also utilize rendering loss for end-to-end training, avoiding detail degradation caused by data transformations (e.g., watertighting).
High-Resolution Modeling: TripoSF can perform training and inference at a high resolution of 1024³, enabling the generation of more detailed and realistic 3D models.

The technical principles of TripoSF

The SparseFlex Representation Method: The core of TripoSF lies in the SparseFlex representation method, which draws inspiration from the advantages of NVIDIA Flexicubes and incorporates a sparse voxel structure. Unlike traditional dense grids, the sparse voxel structure only stores and computes voxel data in regions near the object’s surface, significantly reducing memory usage. This enables TripoSF to perform training and inference at high resolutions of 1024³ while natively supporting arbitrary topological structures.
Frustum-Aware Partitioned Voxel Training Strategy: This strategy borrows the concept of frustum culling from real-time rendering. During each training iteration, only the SparseFlex voxels located within the camera’s frustum are activated and processed. This targeted activation significantly reduces training overhead, making efficient high-resolution training possible.
TripoSF Variational Autoencoder (VAE): Building upon the SparseFlex representation and the efficient training strategy, VAST constructs the TripoSF VAE, forming a complete and highly efficient processing pipeline. From input to encoding, decoding, and output, the TripoSF VAE represents a significant step forward in advancing TripoSF’s reconstruction and generation capabilities.

The project address of TripoSF

Project official website: https://xianglonghe.github.io/TripoSF/
Github repository: https://github.com/VAST-AI-Research/TripoSF
Hugging Face model hub: https://huggingface.co/VAST-AI/TripoSF
arXiv technical paper: https://arxiv.org/pdf/2503.21732

The benchmark test results of TripoSF

The Chamfer Distance (CD) is reduced by approximately 82%: Chamfer Distance is one of the metrics for evaluating the quality of 3D model reconstruction, calculating the distance between the points on the model surface and those on the real model surface. The significant reduction in this metric by TripoSF demonstrates its superiority in capturing model details.
The F-score is improved by approximately 88%: F-score is another metric for evaluating the quality of 3D model reconstruction, comprehensively considering the accuracy and recall of the model. The substantial improvement in this metric by TripoSF indicates that it can maintain model details while effectively capturing the overall structure of the model.

Effect Comparison of TripoSF

TripoSF – The new generation of 3D foundational model launched by VAST AI

Application scenarios of TripoSF

Visual Effects (VFX): TripoSF can generate high-resolution and detailed 3D models, which are suitable for visual effects production in fields such as film and games.
Game Development: In game development, TripoSF can be used to generate high-quality 3D game assets, such as characters, environments, and props.
Embodied Intelligence: TripoSF has broad application prospects in the field of embodied intelligence and can be used for robot simulation and interaction.
Product Design: In the field of product design, TripoSF can be used for rapid prototyping and design verification. Designers can use TripoSF to generate high-resolution 3D models for detailed design evaluation and modification.