Pixels to Play: A Foundation Model for 3D Gameplay
- URL: http://arxiv.org/abs/2508.14295v1
- Date: Tue, 19 Aug 2025 22:24:50 GMT
- Title: Pixels to Play: A Foundation Model for 3D Gameplay
- Authors: Yuguang Yue, Chris Green, Samuel Hunt, Irakli Salia, Wenzhe Shi, Jonathan J Hunt,
- Abstract summary: We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior.
- Score: 4.380638021267298
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A decoder-only transformer with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic MS-DOS titles, ablations on unlabeled data, and outline the scaling and evaluation steps required to reach expert-level, text-conditioned control.
Related papers
- AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games [63.29377274531968]
We introduce the AI GameStore, a scalable and open-ended platform to synthesize new representative human games.<n>We generate 100 such games based on the top charts of Apple App Store and Steam, and evaluate seven frontier vision-language models (VLMs) on short episodes of play.<n>The best models achieved less than 10% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning.
arXiv Detail & Related papers (2026-02-19T18:17:25Z) - NitroGen: An Open Foundation Model for Generalist Gaming Agents [101.41866522979548]
NitroGen is a vision-action foundation model for generalist gaming agents.<n>It is trained on 40,000 hours of gameplay videos across more than 1,000 games.
arXiv Detail & Related papers (2026-01-04T16:24:50Z) - Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents [56.25101378553328]
We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned keyboard-mouse inputs.<n>Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data.<n> Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks.
arXiv Detail & Related papers (2025-10-27T17:43:51Z) - Learning to play: A Multimodal Agent for 3D Game-Play [2.5663091969883993]
We first describe our dataset of human game-play, collected across a large variety of 3-D first-person games.<n>We show the resulting model is capable of playing a variety of 3-D games and responding to text input.
arXiv Detail & Related papers (2025-10-19T09:45:15Z) - Object-centric 3D Motion Field for Robot Learning from Human Videos [56.9436352861611]
We propose to use object-centric 3D motion field to represent actions for robot learning from human videos.<n>We present a novel framework for extracting this representation from videos for zero-shot control.<n> Experiments show that our method reduces 3D motion estimation error by over 50% compared to the latest method.
arXiv Detail & Related papers (2025-06-04T17:59:06Z) - SoccerDiffusion: Toward Learning End-to-End Humanoid Robot Soccer from Gameplay Recordings [2.572390511592254]
SoccerDiffusion is a transformer-based diffusion model to learn end-to-end control policies for humanoid robot soccer.<n>We employ a distillation technique to enable real-time inference on embedded platforms.<n>Our results demonstrate the model's ability to replicate complex motion behaviors in simulation and on physical robots.
arXiv Detail & Related papers (2025-04-29T14:21:08Z) - SynPlay: Importing Real-world Diversity for a Synthetic Human Dataset [19.32308498024933]
We introduce Synthetic Playground (SynPlay), a new synthetic human dataset that aims to bring out the diversity of human appearance in the real world.
We focus on two factors to achieve a level of diversity that has not yet been seen in previous works: realistic human motions and poses.
We show that using SynPlay in model training leads to enhanced accuracy over existing synthetic datasets for human detection and segmentation.
arXiv Detail & Related papers (2024-08-21T17:58:49Z) - Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion
Models [68.85478477006178]
We present a Promptable Game Model (PGM) for neural video game simulators.
It allows a user to play the game by prompting it with high- and low-level action sequences.
Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt.
Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art.
arXiv Detail & Related papers (2023-03-23T17:43:17Z) - Multi-Game Decision Transformers [49.257185338595434]
We show that a single transformer-based model can play a suite of up to 46 Atari games simultaneously at close-to-human performance.
We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning.
We find that our Multi-Game Decision Transformer models offer the best scalability and performance.
arXiv Detail & Related papers (2022-05-30T16:55:38Z) - Playing for 3D Human Recovery [88.91567909861442]
In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths.
Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine.
A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin.
arXiv Detail & Related papers (2021-10-14T17:49:42Z) - Benchmarking End-to-End Behavioural Cloning on Video Games [5.863352129133669]
We study the general applicability of behavioural cloning on twelve video games, including six modern video games (published after 2010)
Our results show that these agents cannot match humans in raw performance but do learn basic dynamics and rules.
We also demonstrate how the quality of the data matters, and how recording data from humans is subject to a state-action mismatch, due to human reflexes.
arXiv Detail & Related papers (2020-04-02T13:31:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.