Matrix-Game: Interactive World Foundation Model
- URL: http://arxiv.org/abs/2506.18701v1
- Date: Mon, 23 Jun 2025 14:40:49 GMT
- Title: Matrix-Game: Interactive World Foundation Model
- Authors: Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, Yahui Zhou,
- Abstract summary: Matrix-Game is an interactive world foundation model for controllable game world generation.<n>Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions.<n>With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements.
- Score: 11.144250200432458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.
Related papers
- Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition [18.789597877579986]
Hunyuan-GameCraft is a novel framework for high-dynamic interactive video generation in game environments.<n>To achieve fine-grained action control, we unify standard keyboard and mouse inputs into a shared camera representation space.<n>We propose a hybrid history-conditioned training strategy that extends video sequences autoregressively while preserving game scene information.
arXiv Detail & Related papers (2025-06-20T17:50:37Z) - PlayerOne: Egocentric World Simulator [73.88786358213694]
PlayerOne is the first egocentric realistic world simulator.<n>It generates egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera.
arXiv Detail & Related papers (2025-06-11T17:59:53Z) - VideoGameBench: Can Vision-Language Models complete popular video games? [8.5302862604852]
Video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases.<n>We introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time.<n>We show that frontier vision-language models struggle to progress beyond the beginning of each game.
arXiv Detail & Related papers (2025-05-23T17:43:27Z) - MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft [21.530000271719803]
We propose MineWorld, a real-time interactive world model on Minecraft.<n>MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input.<n>We develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time.
arXiv Detail & Related papers (2025-04-11T09:41:04Z) - GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control [122.65089441381741]
We present GEM, a Generalizable Ego-vision Multimodal world model.<n>It predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories.<n>Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights.
arXiv Detail & Related papers (2024-12-15T14:21:19Z) - From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos [71.22810401256234]
Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world.<n>Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects.<n>We introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale.
arXiv Detail & Related papers (2024-12-10T18:59:44Z) - The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control [16.075784652681172]
We present The Matrix, the first foundational realistic world simulator capable of generating continuous 720p real-scene video streams.<n>The Matrix allows users to traverse diverse terrains in continuous, uncut hour-long sequences.<n>The Matrix can simulate a BMW X3 driving through an office setting--an environment present in neither gaming data nor real-world sources.
arXiv Detail & Related papers (2024-12-04T18:59:05Z) - GameGen-X: Interactive Open-world Game Video Generation [10.001128258269675]
We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos.<n>It simulates an array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events.<n>It provides interactive controllability, predicting and future altering content based on the current clip, thus allowing for gameplay simulation.
arXiv Detail & Related papers (2024-11-01T17:59:17Z) - Learning Interactive Real-World Simulators [96.5991333400566]
We explore the possibility of learning a universal simulator of real-world interaction through generative modeling.
We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies.
Video captioning models can benefit from training with simulated experience, opening up even wider applications.
arXiv Detail & Related papers (2023-10-09T19:42:22Z) - UniCon: Universal Neural Controller For Physics-based Character Motion [70.45421551688332]
We propose a physics-based universal neural controller (UniCon) that learns to master thousands of motions with different styles by learning on large-scale motion datasets.
UniCon can support keyboard-driven control, compose motion sequences drawn from a large pool of locomotion and acrobatics skills and teleport a person captured on video to a physics-based virtual avatar.
arXiv Detail & Related papers (2020-11-30T18:51:16Z) - Mastering Atari with Discrete World Models [61.7688353335468]
We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in the compact latent space of a powerful world model.
DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model.
arXiv Detail & Related papers (2020-10-05T17:52:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.