MineWorld – An Open-Source Real-Time Interactive World Model by Microsoft Research

What is MineWorld

MineWorld is an open-source real-time interactive world model for Minecraft developed by Microsoft Research. It employs a visual-action autoregressive Transformer architecture to convert game scenes and actions into discrete token IDs, training the model through next-token prediction. A parallel decoding algorithm has been developed to achieve a generation speed of 4 to 7 frames per second, enabling real-time interaction. MineWorld surpasses existing models like Oasis in video quality, controllability, and inference speed.

Key Features of MineWorld

High Generation Quality: Utilizing a visual-action autoregressive Transformer, MineWorld can generate coherent and high-fidelity game frames based on visual inputs and actions.
Strong Controllability: The model demonstrates precise and consistent behavior in benchmark tests for action-following capability, accurately generating game scenes based on input actions.
Fast Inference Speed: The parallel decoding algorithm enables the model to generate images at a speed of 4 to 7 frames per second, supporting real-time interaction.
Functioning as a Game Agent: During training, MineWorld simultaneously predicts game states and actions, allowing it to operate autonomously as an independent game agent.
Real-Time Interaction Capability: Users can interact with the model in real time through web demonstrations or local execution, selecting initial frames, controlling camera movement, and performing game actions.

Technical Principles of MineWorld

Visual-Action Autoregressive Transformer: MineWorld achieves joint modeling of visuals and actions by converting game scenes and player actions into discrete token sequences. Specifically:
- Visual Tokenizer: Employing a VQ-VAE architecture, it segments game scenes into discrete visual tokens. The tokenizer starts from a pre-trained checkpoint and is fine-tuned on the Minecraft dataset to achieve high-quality image reconstruction.
- Action Tokenizer: Continuous player actions (e.g., mouse movements) are quantized into discrete tokens, while discrete actions (e.g., moving forward, attacking) are categorized into different classes, each represented by a unique token.
- Transformer Decoder: Based on the LLaMA architecture, it receives interleaved sequences of visual and action tokens as input and is trained through next-token prediction. The decoder learns rich representations of game states and the conditional relationships between states and actions.
Parallel Decoding Algorithm: To enable real-time interaction, MineWorld introduces a parallel decoding algorithm that leverages spatial dependencies between adjacent visual tokens to predict spatially redundant tokens in each frame simultaneously. Compared to traditional autoregressive decoding algorithms, this approach significantly enhances generation speed, allowing the model to achieve a generation rate of 4 to 7 frames per second across different scales.
Training: The model is trained through next-token prediction, learning the dynamic evolution patterns between game states and the associations between actions and states.
Inference: During inference, the model generates subsequent game scenes based on the current game state and actions. The application of the parallel decoding algorithm enables the model to quickly generate high-quality game frames.
Evaluation Metrics: MineWorld introduces new evaluation metrics to assess the visual quality of generated scenes and action-following capability. For example, by comparing the predicted actions in the generated scenes with the actual input actions, the model’s controllability can be quantified.

Project Address of MineWorld

GitHub Repository: https://github.com/microsoft/MineWorld
HuggingFace Model Hub: https://huggingface.co/microsoft/mineworld
arXiv Technical Paper: https://arxiv.org/pdf/2504.08388

Application Scenarios of MineWorld

Embodied Intelligence Research: MineWorld provides a high-fidelity, interactive virtual environment capable of simulating complex physical rules and dynamic scenes, making it highly suitable for embodied intelligence research. Researchers can train agents using the model to learn task execution in virtual environments, such as object localization, navigation, and environment exploration.
Reinforcement Learning Training: The real-time interaction capability and high generation quality of MineWorld make it an ideal platform for reinforcement learning training. Researchers can rapidly generate large amounts of training data using the model, aiding agents in learning optimal strategies within simulated environments.
Game Agent Development: Since MineWorld simultaneously predicts game states and actions during training, it has the potential to function as a game agent. Given an initial game state and actions, the model can iteratively generate future states and actions, simulating long-term gameplay processes.
Real-Time Interactive Simulation: MineWorld’s fast inference speed (4 to 7 frames per second) supports real-time interaction with game players.
Video Generation and Editing: MineWorld can generate high-quality, coherent game videos, which can be used for video content creation, such as generating game trailers, tutorial videos, and more.