Related papers: Matrix-Game: Interactive World Foundation Model

Matrix-Game: Interactive World Foundation Model

URL: http://arxiv.org/abs/2506.18701v1
Date: Mon, 23 Jun 2025 14:40:49 GMT
Title: Matrix-Game: Interactive World Foundation Model
Authors: Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, Yahui Zhou,
Abstract summary: Matrix-Game is an interactive world foundation model for controllable game world generation.<n>Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions.<n>With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements.
Score: 11.144250200432458
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.

Related papers

Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition [18.789597877579986]
Hunyuan-GameCraft is a novel framework for high-dynamic interactive video generation in game environments.<n>To achieve fine-grained action control, we unify standard keyboard and mouse inputs into a shared camera representation space.<n>We propose a hybrid history-conditioned training strategy that extends video sequences autoregressively while preserving game scene information.
arXiv Detail & Related papers (2025-06-20T17:50:37Z)
PlayerOne: Egocentric World Simulator [73.88786358213694]
PlayerOne is the first egocentric realistic world simulator.<n>It generates egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera.
arXiv Detail & Related papers (2025-06-11T17:59:53Z)
VideoGameBench: Can Vision-Language Models complete popular video games? [8.5302862604852]
Video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases.<n>We introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time.<n>We show that frontier vision-language models struggle to progress beyond the beginning of each game.
arXiv Detail & Related papers (2025-05-23T17:43:27Z)
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft [21.530000271719803]
We propose MineWorld, a real-time interactive world model on Minecraft.<n>MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input.<n>We develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time.
arXiv Detail & Related papers (2025-04-11T09:41:04Z)
GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control [122.65089441381741]
We present GEM, a Generalizable Ego-vision Multimodal world model.<n>It predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories.<n>Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights.
arXiv Detail & Related papers (2024-12-15T14:21:19Z)
From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos [71.22810401256234]
Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world.<n>Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects.<n>We introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale.
arXiv Detail & Related papers (2024-12-10T18:59:44Z)
The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control [16.075784652681172]
We present The Matrix, the first foundational realistic world simulator capable of generating continuous 720p real-scene video streams.<n>The Matrix allows users to traverse diverse terrains in continuous, uncut hour-long sequences.<n>The Matrix can simulate a BMW X3 driving through an office setting--an environment present in neither gaming data nor real-world sources.
arXiv Detail & Related papers (2024-12-04T18:59:05Z)
GameGen-X: Interactive Open-world Game Video Generation [10.001128258269675]
We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos.<n>It simulates an array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events.<n>It provides interactive controllability, predicting and future altering content based on the current clip, thus allowing for gameplay simulation.
arXiv Detail & Related papers (2024-11-01T17:59:17Z)
Learning Interactive Real-World Simulators [96.5991333400566]
We explore the possibility of learning a universal simulator of real-world interaction through generative modeling. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies. Video captioning models can benefit from training with simulated experience, opening up even wider applications.
arXiv Detail & Related papers (2023-10-09T19:42:22Z)
UniCon: Universal Neural Controller For Physics-based Character Motion [70.45421551688332]
We propose a physics-based universal neural controller (UniCon) that learns to master thousands of motions with different styles by learning on large-scale motion datasets. UniCon can support keyboard-driven control, compose motion sequences drawn from a large pool of locomotion and acrobatics skills and teleport a person captured on video to a physics-based virtual avatar.
arXiv Detail & Related papers (2020-11-30T18:51:16Z)
Mastering Atari with Discrete World Models [61.7688353335468]
We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in the compact latent space of a powerful world model. DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model.
arXiv Detail & Related papers (2020-10-05T17:52:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.