EX-4D – A 4D video generation framework developed by ByteDance’s Pico team
What is EX-4D?
EX-4D is a novel 4D video generation framework developed by Pico, a team under ByteDance. It can generate high-quality 4D videos from monocular video input under extreme viewpoints. The framework is based on a unique Depth Waterproof Mesh (DW-Mesh) representation, explicitly modeling visible and occluded regions to ensure geometric consistency under extreme camera poses. It employs a simulated occlusion mask strategy to generate effective training data from monocular videos and synthesizes physically consistent and temporally coherent videos using a lightweight LoRA-based video diffusion adapter. EX-4D significantly outperforms existing methods under extreme viewpoints and offers a new solution for 4D video generation.
Main Features of EX-4D
-
Extreme Viewpoint Video Generation: Supports generating videos with extreme viewpoints ranging from -90° to 90°, providing a rich viewing experience.
-
Geometric Consistency Preservation: Based on the Depth Waterproof Mesh (DW-Mesh) representation, it ensures consistent geometric structure across different viewpoints.
-
Occlusion Handling: Effectively manages boundary occlusions to avoid visual artifacts caused by viewpoint changes.
-
Temporal Coherence: Generated videos maintain high temporal coherence, avoiding common flickering and jittering issues.
-
No Multi-View Data Required: Uses a simulated occlusion mask strategy to train on monocular videos, eliminating the need for costly multi-view datasets.
Technical Principles of EX-4D
-
Depth Waterproof Mesh (DW-Mesh): DW-Mesh models visible surfaces while explicitly modeling occluded boundaries, ensuring geometric consistency under extreme viewpoints. It provides reliable occlusion masks for each viewpoint to effectively handle boundary occlusions.
-
Simulated Occlusion Mask Strategy: Based on DW-Mesh, it simulates occlusions from novel viewpoints to generate effective training data. Inter-frame point tracking ensures temporal consistency, simulating realistic occlusion changes in dynamic scenes.
-
Lightweight LoRA-based Video Diffusion Adapter: Efficiently integrates geometric information from DW-Mesh with pre-trained video diffusion models to generate high-quality videos. With only about 1% trainable parameters, it greatly reduces computational requirements and improves training and inference efficiency.
Project Links
-
Official website: https://tau-yihouxiang.github.io/projects/EX-4D/EX-4D.html
-
GitHub repository: https://github.com/tau-yihouxiang/EX-4D
-
arXiv paper: https://arxiv.org/pdf/2506.05554
Applications of EX-4D
-
Immersive Entertainment Experiences: Used in sports events, concerts, and live broadcasts, allowing viewers to freely switch perspectives and enhancing engagement.
-
Game Development: Generates free-viewpoint game scenes and cutscenes, improving player immersion and interaction.
-
Education and Training: Creates virtual teaching environments, such as virtual labs and surgical simulations, enhancing learning outcomes.
-
Advertising and Marketing: Produces interactive ads and virtual showrooms, enabling consumers to view products from all angles and improving shopping experiences.
-
Cultural Heritage Preservation: Recreates historical scenes and builds virtual museums, allowing multi-angle appreciation of cultural artifacts and artworks.