Depth Anything 3 – A Visual Spatial Reconstruction Model Developed by ByteDance
What is Depth Anything 3?
Depth Anything 3 (DA3) is a visual spatial reconstruction model developed by ByteDance’s Seed team. It reconstructs 3D geometric structure from visual inputs of arbitrary viewpoints using a unified Transformer architecture. The model introduces a “depth-ray” representation that eliminates the need for complex multi-task training, greatly simplifying model design. Depth Anything 3 surpasses previous mainstream models in both camera pose accuracy and geometric reconstruction performance, while maintaining high inference efficiency. It is suitable for applications in autonomous driving, robot navigation, virtual reality, and more, offering a new and efficient solution for visual spatial reconstruction.

Key Features of Depth Anything 3
Multi-view 3D Spatial Reconstruction
DA3 can reconstruct 3D spatial structure from any number of visual inputs, including single images, multi-view photos, or video streams.
Camera Pose Estimation
The model can accurately estimate camera pose—including position and orientation—even without known camera parameters.
Monocular Depth Estimation
DA3 delivers strong performance in monocular depth estimation, predicting pixel-level depth from a single image to support 3D scene understanding.
Novel View Synthesis
By integrating with 3D Gaussian rendering, the model can generate high-quality images from novel viewpoints, making it suitable for VR/AR view-synthesis tasks.
Efficient Inference and Deployment
Its streamlined architecture enables fast inference with low resource consumption, supporting deployment on mobile and embedded devices.
Technical Principles of Depth Anything 3
Unified Transformer Architecture
DA3 uses a single Transformer model (e.g., DINOv2) as its core architecture. Without complex custom modules, the Transformer’s self-attention mechanism flexibly handles any number of input views and dynamically exchanges cross-view information for efficient global spatial modeling.
Depth-Ray Representation
The model introduces a “depth-ray” representation that predicts a depth map and a ray map to describe the 3D scene. The depth map gives pixel-to-camera distance, while the ray map describes the projection direction of each pixel in 3D space. This representation naturally decouples scene geometry from camera motion, simplifying outputs and improving accuracy and efficiency.
Input-Adaptive Cross-View Attention
DA3 incorporates an input-adaptive cross-view self-attention mechanism that dynamically reorders input view tokens to enable efficient cross-view information exchange. This allows the model to handle a wide range of input configurations, from single-view to multi-view setups.
Dual DPT Heads
To jointly predict depth and ray maps, DA3 employs a dual DPT-head design. The two heads share feature-processing modules and optimize depth and ray outputs independently in the final stage, enhancing interaction and consistency between the two tasks.
Teacher–Student Training Paradigm
DA3 uses a teacher–student training paradigm, where a teacher model trained on synthetic data generates high-quality pseudo-labels to provide better supervision signals for the student model.
Single-Step High-Accuracy Output
DA3 generates accurate depth and ray outputs in one forward pass, without the iterative optimization used in traditional methods. This greatly increases inference speed, simplifies training and deployment, and maintains high precision in 3D reconstruction.
Project Resources
-
Project Website: https://depth-anything-3.github.io/
-
GitHub Repository: https://github.com/ByteDance-Seed/depth-anything-3
-
arXiv Paper: https://arxiv.org/pdf/2511.10647
-
Online Demo: https://huggingface.co/spaces/depth-anything/depth-anything-3
Application Scenarios of Depth Anything 3
Autonomous Driving
DA3 reconstructs 3D environments from multi-view camera images, helping autonomous vehicles more accurately perceive object positions and distances, improving decision reliability and safety.
Robotic Navigation
By reconstructing real-time 3D structure, DA3 provides robots with precise terrain and obstacle information for efficient navigation and path planning in complex environments.
Virtual Reality (VR) & Augmented Reality (AR)
DA3 can rapidly convert real-world scenes into high-quality 3D models for VR scene reconstruction or AR virtual object integration, enhancing immersion.
Architectural Mapping & Design
DA3 reconstructs detailed 3D point clouds of architectural scenes from multi-angle images, supporting architectural surveying, interior design, and virtual walkthroughs.
Cultural Heritage Preservation
DA3 enables high-fidelity 3D reconstruction of historical buildings and artifacts, supporting digital preservation, restoration research, and virtual exhibition.