Multiverse – The global first AI-generated multiplayer game model introduced by Enigma Labs
What is Multiverse?
Multiverse is the world’s first AI-generated multiplayer game model, launched by Israeli team Enigma Labs. It is a multiplayer racing game where players can overtake, drift, and accelerate—every action dynamically reshapes the game world in real time. The model uses AI to generate synchronized and coherent visuals for all players, ensuring a logically consistent game world.
Multiverse is built on a diffusion model that merges player perspectives and actions to produce seamless and consistent gameplay. Its core innovation lies in a novel multiplayer world model architecture, which leverages joint action vectors and dual-view channel stacking to solve the problem of visual consistency in multiplayer environments. Remarkably, the model costs only $1,500 to train and can run on a standard PC. The project’s code, data, weights, architecture, and research are fully open-source, opening up new possibilities for AI in multiplayer game development.
Key Features of Multiverse
-
Real-time Multiplayer Interaction: Supports two players interacting within the same virtual world in real time—actions like overtaking and collisions are consistently rendered from both perspectives.
-
Dynamic World Generation: The game visuals are generated in real time based on the players’ inputs and actions.
-
Efficient Frame Prediction: Accurately predicts future game frames to ensure smooth and coherent gameplay.
-
Low-Cost Operation: Runs on regular personal computers, eliminating the need for high-end hardware and lowering entry barriers.
Technical Principles of Multiverse
Multiplayer Game Architecture
To build a multiplayer world model, Multiverse restructures core modules from the ground up and redesigns the training pipeline to enable true cooperative gameplay:
-
Action Embedder: Captures the actions of both players and outputs a joint embedding.
-
Denoising Network: A diffusion network that simultaneously generates both players’ frames based on previous frames and the action embedding.
-
Upsampler: Receives and upsamples the frames of both players independently.
Perspective Fusion Solution
To ensure visual consistency between players, the model gathers past frames and actions from both perspectives and outputs predicted frames. The outputs must be not only visually plausible but also logically coherent. Multiverse introduces an innovative workaround:
-
Player views are concatenated into a single image.
-
Player inputs are fused into a joint action vector.
-
The result is treated as a unified scene by stacking the two frames along the channel axis, effectively creating an image with doubled color channels.
-
Because the diffusion model is a U-Net primarily composed of convolution and deconvolution layers, stacking by channels ensures that both views are processed simultaneously at every layer, unlike vertical stacking, which delays cross-view interaction.
Training Strategies
-
Context Extension:
To accurately predict the next frame, the model requires recent frames and player actions (like steering input). It was found that 8 frames (at 30 fps) suffice to model basic vehicle dynamics. However, to capture relative movement between cars—slower than road movement—context size needs to be increased by 3×. This slows down training and increases memory use.
To balance efficiency and temporal depth, the model uses sparse frame sampling:-
It inputs the latest 4 frames,
-
Then every 4th frame among the next 4 frames,
-
The earliest frame in context is from 20 frames ago (0.666 seconds)—enough to capture relative motion.
-
-
Multiplayer Training:
Training for multiplayer driving requires longer prediction horizons than single-agent tasks. For example, basic driving models typically forecast 0.25 seconds ahead, which is insufficient for player interaction.
Multiverse trains with autoregressive prediction for up to 15 seconds into the future (at 30 fps), using curriculum learning—gradually increasing the prediction length from 0.25 to 15 seconds.
Early stages focus on learning low-level features (car geometry, track), followed by higher-level behaviors (player interaction). This greatly enhances physical and temporal consistency. -
Efficient Long-Horizon Training:
Predicting over 100 future frames stresses GPU VRAM. To address this:-
Multiverse uses a paging strategy.
-
Each training batch loads a portion of the sequence.
-
Previous frames outside the context window are discarded as new data is loaded, enabling long-term prediction without memory overload.
-
Multiverse Project Links
-
Project Website: https://enigma-labs.io/blog
-
GitHub Repository: https://github.com/EnigmaLabsAI/multiverse
-
HuggingFace Model Hub: https://huggingface.co/Enigma-AI
Multiverse Dataset
-
Data Source: The training data is collected from Sony’s racing game Gran Turismo 4.
-
Data Collection Method:
-
Each race is replayed twice from both players’ perspectives using the game’s replay system.
-
The replays are synchronized and merged to show both players simultaneously.
-
Computer vision is used to extract UI elements frame by frame (throttle, brake, steering), which are then reverse-engineered into control inputs—no game logs needed.
-
-
Automated Data Generation:
-
Scripts send randomized inputs to the game’s B-Spec mode.
-
The resulting races are recorded from two angles, capturing AI-driven third-person gameplay for training.
-
Application Scenarios of Multiverse
-
Multiplayer Game Development: Enhance real-time interaction in online games with realistic and responsive world models.
-
VR/AR Applications: Build shared virtual environments to boost immersion and social interaction.
-
AI Training & Research: Use as an open-source platform to train intelligent agents and study complex decision-making and collaboration.
-
Education & Training: Create virtual training environments for driving, military simulations, or teamwork exercises.
-
Entertainment & Socializing: Enable innovative social experiences like virtual parties and online events.
-
Simulation & Management Games: Support resource management, city-building, and strategic decision-making, where each action dynamically alters the game world’s economy and ecology.