Related papers: Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model

Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model

URL: http://arxiv.org/abs/2508.13009v1
Date: Mon, 18 Aug 2025 15:28:53 GMT
Title: Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model
Authors: Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Cyrus Wu, Wei Li, Xuchen Song, Yang Liu, Eric Li, Yahui Zhou,
Abstract summary: Matrix-Game 2.0 is an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion.<n>It can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS.
Score: 15.16063778402193
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in interactive video generations have demonstrated diffusion model's potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix-Game 2.0, an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (about 1200 hours) of video data with diverse interaction annotations; (2) An action injection module that enables frame-level mouse and keyboard inputs as interactive conditions; (3) A few-step distillation based on the casual architecture for real-time and streaming video generation. Matrix Game 2.0 can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS. We open-source our model weights and codebase to advance research in interactive world modeling.

Related papers

TeleWorld: Towards Dynamic Multimodal Synthesis with a 4D World Model [53.555353366322464]
We present TeleWorld, a real-time multimodal 4D world modeling framework that unifies video generation, dynamic scene reconstruction, and long-term world memory within a closed-loop system.<n>Our approach achieves seamless integration of dynamic object modeling and static scene representation within a unified 4D framework, advancing world models toward practical, interactive, and computationally accessible synthesis systems.
arXiv Detail & Related papers (2025-12-31T18:31:46Z)
UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework [54.337290937468175]
We propose UniMo, an autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework.<n>We show that our method simultaneously generates corresponding videos and motions while performing accurate motion capture.
arXiv Detail & Related papers (2025-12-03T16:03:18Z)
SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation [64.3409486422946]
We present SpriteHand, an autoregressive video generation framework for real-time synthesis of hand-object interaction videos.<n>Our model employs a causal inference architecture for autoregressive generation and leverages a hybrid post-training approach to enhance visual realism and temporal coherence.<n> Experiments demonstrate superior visual quality, physical plausibility, and interaction fidelity compared to both generative and engine-based baselines.
arXiv Detail & Related papers (2025-12-01T18:13:40Z)
Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model [19.937724706042804]
Hunyuan-GameCraft-2 is a new paradigm of instruction-driven interaction for generative game world modeling.<n>Our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals.<n>Our model generates temporally coherent and causally grounded interactive game videos.
arXiv Detail & Related papers (2025-11-28T18:26:39Z)
Simulating the Visual World with Artificial Intelligence: A Roadmap [48.64639618440864]
Video generation is shifting from generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility.<n>This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components.<n>We trace the progression of video generation through four generations, culminating in a video generation model that embodies intrinsic physical plausibility.
arXiv Detail & Related papers (2025-11-11T18:59:50Z)
Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition [18.789597877579986]
Hunyuan-GameCraft is a novel framework for high-dynamic interactive video generation in game environments.<n>To achieve fine-grained action control, we unify standard keyboard and mouse inputs into a shared camera representation space.<n>We propose a hybrid history-conditioned training strategy that extends video sequences autoregressively while preserving game scene information.
arXiv Detail & Related papers (2025-06-20T17:50:37Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
Vid2World: Crafting Video Diffusion Models to Interactive World Models [38.270098691244314]
Vid2World is a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models.<n>It performs casualization of a pre-trained video diffusion model by crafting its architecture and training objective to enable autoregressive generation.<n>It introduces a causal action guidance mechanism to enhance action controllability in the resulting interactive world model.
arXiv Detail & Related papers (2025-05-20T13:41:45Z)
Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z)
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models [50.38647583839384]
We propose InterDyn, a framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor.<n>Our key insight is that large video generation models can act as both neurals and implicit physics simulators'', having learned interactive dynamics from large-scale video data.
arXiv Detail & Related papers (2024-12-16T13:57:02Z)
Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video [23.484070818399]
Video2Game is a novel approach that automatically converts videos of real-world scenes into realistic and interactive game environments. We show that we can not only produce highly-realistic renderings in real-time, but also build interactive games on top.
arXiv Detail & Related papers (2024-04-15T14:32:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.