Learning to play: A Multimodal Agent for 3D Game-Play
- URL: http://arxiv.org/abs/2510.16774v1
- Date: Sun, 19 Oct 2025 09:45:15 GMT
- Title: Learning to play: A Multimodal Agent for 3D Game-Play
- Authors: Yuguang Yue, Irakli Salia, Samuel Hunt, Christopher Green, Wenzhe Shi, Jonathan J Hunt,
- Abstract summary: We first describe our dataset of human game-play, collected across a large variety of 3-D first-person games.<n>We show the resulting model is capable of playing a variety of 3-D games and responding to text input.
- Score: 2.5663091969883993
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We argue that 3-D first-person video games are a challenging environment for real-time multi-modal reasoning. We first describe our dataset of human game-play, collected across a large variety of 3-D first-person games, which is both substantially larger and more diverse compared to prior publicly disclosed datasets, and contains text instructions. We demonstrate that we can learn an inverse dynamics model from this dataset, which allows us to impute actions on a much larger dataset of publicly available videos of human game play that lack recorded actions. We then train a text-conditioned agent for game playing using behavior cloning, with a custom architecture capable of realtime inference on a consumer GPU. We show the resulting model is capable of playing a variety of 3-D games and responding to text input. Finally, we outline some of the remaining challenges such as long-horizon tasks and quantitative evaluation across a large set of games.
Related papers
- Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing [2.5663091969883993]
We release all data (8300+ hours of high quality human gameplay), training and inference code, and pretrained checkpoints under an open license.<n>We show that our best model is capable of playing a variety of 3D video games at a level competitive with human play.<n>We first show in a simple toy problem that, for some types of causal reasoning, increasing both the amount of training data and the depth of the network results in the model learning a more causal policy.
arXiv Detail & Related papers (2026-01-08T04:06:17Z) - NitroGen: An Open Foundation Model for Generalist Gaming Agents [101.41866522979548]
NitroGen is a vision-action foundation model for generalist gaming agents.<n>It is trained on 40,000 hours of gameplay videos across more than 1,000 games.
arXiv Detail & Related papers (2026-01-04T16:24:50Z) - Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents [56.25101378553328]
We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned keyboard-mouse inputs.<n>Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data.<n> Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks.
arXiv Detail & Related papers (2025-10-27T17:43:51Z) - Pixels to Play: A Foundation Model for 3D Gameplay [4.380638021267298]
We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior.
arXiv Detail & Related papers (2025-08-19T22:24:50Z) - Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We propose a 3D reasoning segmentation task for reasoning segmentation with multiple objects in scenes.<n>The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects.<n>In addition, we design MORE3D, a novel 3D reasoning network that works with queries of multiple objects.
arXiv Detail & Related papers (2024-11-21T08:22:45Z) - Diffusion Models are Efficient Data Generators for Human Mesh Recovery [55.37787289869703]
We show that synthetic data created by generative models is complementary to CG-rendered data.<n>We propose an effective data generation pipeline based on recent diffusion models, termed HumanWild.<n>Our work could pave the way for scaling up 3D human recovery to in-the-wild scenes.
arXiv Detail & Related papers (2024-03-17T06:31:16Z) - Modeling Player Personality Factors from In-Game Behavior and Affective
Expression [17.01727448431269]
We explore possibilities to predict a series of player personality questionnaire metrics from recorded in-game behavior.
We predict a wide variety of personality metrics from seven established questionnaires across 62 players over 60 minute gameplay of a customized version of the role-playing game Fallout: New Vegas.
arXiv Detail & Related papers (2023-08-27T22:59:08Z) - Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion
Models [68.85478477006178]
We present a Promptable Game Model (PGM) for neural video game simulators.
It allows a user to play the game by prompting it with high- and low-level action sequences.
Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt.
Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art.
arXiv Detail & Related papers (2023-03-23T17:43:17Z) - Playing for 3D Human Recovery [88.91567909861442]
In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths.
Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine.
A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin.
arXiv Detail & Related papers (2021-10-14T17:49:42Z) - Benchmarking End-to-End Behavioural Cloning on Video Games [5.863352129133669]
We study the general applicability of behavioural cloning on twelve video games, including six modern video games (published after 2010)
Our results show that these agents cannot match humans in raw performance but do learn basic dynamics and rules.
We also demonstrate how the quality of the data matters, and how recording data from humans is subject to a state-action mismatch, due to human reflexes.
arXiv Detail & Related papers (2020-04-02T13:31:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.