Solaris: Building a Multiplayer Video World Model in Minecraft
- URL: http://arxiv.org/abs/2602.22208v2
- Date: Thu, 26 Feb 2026 04:21:35 GMT
- Title: Solaris: Building a Multiplayer Video World Model in Minecraft
- Authors: Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie,
- Abstract summary: Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives.<n>We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations.<n>We collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency.
- Score: 25.935990718354176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.
Related papers
- Simulating the Visual World with Artificial Intelligence: A Roadmap [48.64639618440864]
Video generation is shifting from generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility.<n>This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components.<n>We trace the progression of video generation through four generations, culminating in a video generation model that embodies intrinsic physical plausibility.
arXiv Detail & Related papers (2025-11-11T18:59:50Z) - MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation [20.943599420478105]
We propose a novel multi-modal multi-view video generation approach to autonomous driving.<n>Our approach is capable of generating multi-modal multi-view driving scene videos in a unified framework.<n>Our experiments on the challenging real-world autonomous driving dataset, nuScenes, show that our approach can generate multi-modal multi-view urban scene videos with high fidelity and controllability.
arXiv Detail & Related papers (2025-08-20T00:51:36Z) - Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model [15.16063778402193]
Matrix-Game 2.0 is an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion.<n>It can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS.
arXiv Detail & Related papers (2025-08-18T15:28:53Z) - PlayerOne: Egocentric World Simulator [73.88786358213694]
PlayerOne is the first egocentric realistic world simulator.<n>It generates egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera.
arXiv Detail & Related papers (2025-06-11T17:59:53Z) - EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z) - MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft [21.530000271719803]
We propose MineWorld, a real-time interactive world model on Minecraft.<n>MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input.<n>We develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time.
arXiv Detail & Related papers (2025-04-11T09:41:04Z) - Exploration-Driven Generative Interactive Environments [53.05314852577144]
We focus on using many virtual environments for inexpensive, automatically collected interaction data.<n>We propose a training framework merely using a random agent in virtual environments.<n>Our agent is fully independent of environment-specific rewards and thus adapts easily to new environments.
arXiv Detail & Related papers (2025-04-03T12:01:41Z) - GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control [122.65089441381741]
We present GEM, a Generalizable Ego-vision Multimodal world model.<n>It predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories.<n>Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights.
arXiv Detail & Related papers (2024-12-15T14:21:19Z) - Multi-Game Decision Transformers [49.257185338595434]
We show that a single transformer-based model can play a suite of up to 46 Atari games simultaneously at close-to-human performance.
We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning.
We find that our Multi-Game Decision Transformer models offer the best scalability and performance.
arXiv Detail & Related papers (2022-05-30T16:55:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.