Aether – An Open-Source Generative World Model by Shanghai AI Lab

What is Aether?

Aether is an open-source generative world model developed by Shanghai AI Lab, trained entirely on synthetic data. It is the first to deeply integrate 3D spatiotemporal modeling with generative modeling, and it features three core capabilities: 4D dynamic reconstruction, action-conditioned video prediction, and goal-directed visual planning.

Aether perceives environments, understands object positions and motion relationships, and makes intelligent decisions. Despite being trained only on virtual data, Aether demonstrates strong zero-shot generalization in the real world, enabling efficient handling of complex tasks. It offers powerful spatial reasoning and decision-making support for embodied AI systems.

Key Features of Aether

4D Dynamic Reconstruction:
Reconstructs 3D scenes with both spatial and temporal information from video inputs, capturing dynamic changes over time.
Action-Conditioned Video Prediction:
Predicts future scene changes based on initial observations and action trajectories.
Goal-Directed Visual Planning:
Generates plausible paths between initial and goal states to assist intelligent systems in planning actions.

Technical Principles of Aether

Unified Multi-Task Framework:
Combines dynamic reconstruction, video prediction, and action planning within a single framework. Through interleaved task feature learning, it enables joint optimization across tasks, enhancing model robustness and stability.
Geometry-Aware Modeling:
Incorporates 3D spatiotemporal modeling to enhance spatial reasoning. Uses large-scale simulated RGBD data (color images + depth maps), with a complete data cleaning and dynamic reconstruction pipeline and richly annotated action sequences.
Camera Trajectories as Action Representation:
Uses camera trajectories as representations of global actions. In navigation tasks, these correspond directly to the navigation path; in robotic manipulation, motion of the handheld camera captures the 6D movement of the end-effector.
Diffusion Models and Multimodal Fusion:
Fine-tuned using synthetic 4D data on top of pretrained video diffusion models. Depth videos are transformed into scale-invariant normalized disparity representations, and camera trajectories are encoded as scale-invariant ray-map sequences aligned with the spatiotemporal frames of Diffusion Transformers (DiTs). Aether integrates conditional signals across tasks and modalities for effective multimodal fusion and joint optimization.
Zero-Shot Generalization:
Trained entirely on synthetic data, Aether generalizes to real-world tasks without any real-world samples. By combining various conditional inputs—such as observation frames, goal frames, and action trajectories—and leveraging the diffusion process, it unifies modeling and generation for diverse tasks, achieving impressive performance in real scenarios.

Project Resources

Official Website: https://aether-world.github.io/
GitHub Repository: https://github.com/OpenRobotLab/Aether
HuggingFace Model Hub: https://huggingface.co/AetherWorldModel/AetherV1
arXiv Technical Paper: https://arxiv.org/pdf/2503.18945
Live Demo: https://huggingface.co/spaces/AmberHeart/AetherV1

Application Scenarios for Aether

Robot Navigation:
Helps robots plan paths and avoid dynamic obstacles.
Autonomous Driving:
Enables real-time road scene reconstruction and traffic prediction.
Virtual Reality:
Generates immersive virtual scenes to enhance user experiences.
Industrial Robotics:
Optimizes robotic operation paths, improving manufacturing efficiency.
Smart Surveillance:
Analyzes surveillance footage and predicts abnormal behaviors.