VPP – The First AIGC Robot Large Model Launched Jointly by Tsinghua University and Star Motion Era

What is VPP?

VPP (Video Prediction Policy) is the first AIGC (AI-Generated Content) large model for robotics developed by Tsinghua University and Stargate Epoch. Based on a pre-trained video diffusion model, it learns from vast amounts of internet video data to directly predict future scenes and generate robot actions. VPP enables robots to foresee future scenarios, perform high-frequency predictions and actions, and supports transfer across different humanoid robot bodies—greatly reducing reliance on high-quality real-world robot data.

VPP achieved near-perfect scores on the Calvin ABC-D benchmark, and has demonstrated exceptional performance in complex real-world dexterous manipulation tasks. Its open-source release provides robust technical support for the advancement of embodied intelligence in robotics.

Key Features of VPP

Future Scene Prediction: Allows robots to “see” into the future before taking action, enhancing generalization ability.
High-Frequency Prediction and Action Execution: Achieves prediction frequencies of 6–10 Hz and control frequencies over 50 Hz, improving the fluidity of actions.
Cross-Robot Body Learning: Learns from video data of various robot forms, including human demonstrations, reducing the cost of data collection.
Multi-task Learning and Generalization: Performs well in complex real-world tasks such as grasping, placing, stacking, pouring, and tool use.
Interpretability and Debugging: Predictive video outputs help identify potential failure scenarios in advance, aiding targeted development and optimization.

Technical Principles of VPP

Predictive Visual Representation from Video Diffusion Models (VDMs): VPP learns to predict future scenes using pre-trained video diffusion models such as Stable Video Diffusion. These models denoise one frame at a time to generate predictive visual representations, which include the current frame and clearly indicate future frames.
Action Learning: VPP uses Video Former to aggregate predictive visual representations and extract spatiotemporal information. Robot actions are then generated using a diffusion policy, enabling a seamless transition from prediction to execution.
Optimization and Generalization: VPP is trained on a combination of internet video and robot interaction data, reducing reliance on high-quality real robot data. Through cross-body learning, it directly learns from various robot morphologies, enhancing generalization capabilities.

Project Links for VPP

Official Website: https://video-prediction-policy.github.io/
GitHub Repository: https://github.com/roboterax/video-prediction-policy
arXiv Technical Paper: https://arxiv.org/pdf/2412.14803

Application Scenarios for VPP

Home Services: Performing household tasks (e.g., pouring water, fetching items), assisting in elderly or child care (e.g., item delivery).
Industrial Manufacturing: Used in part grasping, cargo handling, and stacking to improve production efficiency.
Medical Assistance: Assisting with surgical tool delivery, rehabilitation training, and hospital logistics.
Education and Research: Helping students understand complex manipulation processes, supporting laboratory experiments.
Service Industry: Applications such as food delivery in restaurants, luggage handling in hotels, and navigation in public spaces.