MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
- URL: http://arxiv.org/abs/2504.08388v1
- Date: Fri, 11 Apr 2025 09:41:04 GMT
- Title: MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
- Authors: Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, Jiang Bian,
- Abstract summary: We propose MineWorld, a real-time interactive world model on Minecraft.<n>MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input.<n>We develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time.
- Score: 21.530000271719803
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate $4$ to $7$ frames per second and enabling real-time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion based world models significantly. The code and model have been released.
Related papers
- Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination [25.62602420895531]
DreMa is a new approach for constructing digital twins using learned explicit representations of the real world and its dynamics.<n>We show that DreMa can successfully learn novel physical tasks from just a single example per task variation.
arXiv Detail & Related papers (2024-12-19T15:38:15Z) - Diffusion for World Modeling: Visual Details Matter in Atari [22.915802013352465]
We introduce DIAMOND (DIffusion As a Model Of eNvironment Dreams), a reinforcement learning agent trained in a diffusion world model.
We analyze the key design choices that are required to make diffusion suitable for world modeling, and demonstrate how improved visual details can lead to improved agent performance.
DIAMOND achieves a mean human normalized score of 1.46 on the competitive Atari 100k benchmark; a new best for agents trained entirely within a world model.
arXiv Detail & Related papers (2024-05-20T22:51:05Z) - Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video [23.484070818399]
Video2Game is a novel approach that automatically converts videos of real-world scenes into realistic and interactive game environments.
We show that we can not only produce highly-realistic renderings in real-time, but also build interactive games on top.
arXiv Detail & Related papers (2024-04-15T14:32:32Z) - WorldDreamer: Towards General World Models for Video Generation via
Predicting Masked Tokens [75.02160668328425]
We introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions.
WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge.
Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments.
arXiv Detail & Related papers (2024-01-18T14:01:20Z) - Slot Structured World Models [0.0]
State-of-the-art approaches use a feedforward encoder to extract object embeddings and a latent graph neural network to model the interaction between these object embeddings.
We introduce it Slot Structured World Models (SSWM), a class of world models that combines an it object-centric encoder with a latent graph-based dynamics model.
arXiv Detail & Related papers (2024-01-08T21:19:30Z) - Learning Interactive Real-World Simulators [96.5991333400566]
We explore the possibility of learning a universal simulator of real-world interaction through generative modeling.
We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies.
Video captioning models can benefit from training with simulated experience, opening up even wider applications.
arXiv Detail & Related papers (2023-10-09T19:42:22Z) - Tachikuma: Understading Complex Interactions with Multi-Character and
Novel Objects by Large Language Models [67.20964015591262]
We introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation task and a supporting dataset.
The dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations.
We present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding.
arXiv Detail & Related papers (2023-07-24T07:40:59Z) - Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion
Models [68.85478477006178]
We present a Promptable Game Model (PGM) for neural video game simulators.
It allows a user to play the game by prompting it with high- and low-level action sequences.
Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt.
Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art.
arXiv Detail & Related papers (2023-03-23T17:43:17Z) - Infusing Commonsense World Models with Graph Knowledge [89.27044249858332]
We study the setting of generating narratives in an open world text adventure game.
A graph representation of the underlying game state can be used to train models that consume and output both grounded graph representations and natural language descriptions and actions.
arXiv Detail & Related papers (2023-01-13T19:58:27Z) - Mastering Atari with Discrete World Models [61.7688353335468]
We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in the compact latent space of a powerful world model.
DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model.
arXiv Detail & Related papers (2020-10-05T17:52:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.