CombatVLA – a 3D action game–specific VLA model launched by Taotian Group
What is CombatVLA?
CombatVLA is an efficient Vision-Language-Action (VLA) model developed by the Future Life Laboratory team at Taotian Group, specifically designed for combat tasks in 3D action role-playing games (ARPGs). Built on a 3B-parameter scale, the model is trained using video-action pairs collected via motion trackers, with data formatted into Action-of-Thought (AoT) sequences. Using a three-stage progressive learning paradigm—from video-level to frame-level to truncated strategies—the model achieves highly efficient reasoning. On combat understanding benchmarks, CombatVLA outperforms existing models, delivering 50x faster inference speed and a task success rate exceeding that of human players.
Key Features
-
Efficient combat decision-making: Capable of making real-time combat decisions in complex 3D game environments, such as dodging attacks, casting skills, or restoring health—achieving decision speeds up to 50x faster than traditional models.
-
Combat understanding and reasoning: Evaluates enemy states, predicts enemy attack intentions, and reasons out the optimal combat actions, significantly surpassing other models in battle comprehension.
-
Action command generation: Outputs executable keyboard and mouse operation instructions (e.g., pressing specific keys or performing mouse actions) to control in-game characters.
-
Generalization ability: Demonstrates strong generalization across varying task difficulties and different games, effectively executing combat tasks even in unseen game scenarios.
Technical Principles of CombatVLA
-
Motion tracker: Collects human player operation data (keyboard and mouse inputs) synchronized with in-game visuals to generate video-action pair datasets.
-
Action-of-Thought (AoT) sequences: Converts collected data into AoT sequences, where each action is paired with detailed explanations to help the model understand the semantics and logic of actions.
-
Three-stage progressive learning:
-
Stage 1: Video-level AoT fine-tuning for initial understanding of the combat environment.
-
Stage 2: Frame-level AoT fine-tuning to ensure strict alignment between actions and preceding frames.
-
Stage 3: Frame-level truncated AoT fine-tuning, introducing a special
<TRUNC>
token to truncate outputs for faster inference.
-
-
Adaptive action-weighted loss: Combines action-alignment loss and modality-contrastive loss to optimize training and ensure accurate action output.
-
Action execution framework: Converts model-generated action instructions into actual keyboard and mouse operations to control game characters automatically.
Project Links
-
Official Website: https://combatvla.github.io/
-
GitHub Repository: https://github.com/ChenVoid/CombatVLA
-
arXiv Paper: https://arxiv.org/pdf/2503.09527
Application Scenarios of CombatVLA
-
3D ARPG gameplay: Real-time control of game characters during combat, enabling efficient decision-making and action execution for enhanced gameplay.
-
Game testing and optimization: Assists developers in testing combat systems, identifying issues, and optimizing game mechanics.
-
Esports training: Provides intelligent opponents for professional players, supporting practice in combat strategies and skill improvement.
-
Game content creation: Helps developers generate combat scenarios and narratives, accelerating the construction of complex levels and missions.
-
Robotics control: Extends to real-world robotics, enabling robots to make rapid decisions and execute actions in dynamic environments.