VideoGameBench: Can Vision-Language Models complete popular video games?
- URL: http://arxiv.org/abs/2505.18134v2
- Date: Fri, 30 May 2025 14:50:48 GMT
- Title: VideoGameBench: Can Vision-Language Models complete popular video games?
- Authors: Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, Ofir Press,
- Abstract summary: Video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases.<n>We introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time.<n>We show that frontier vision-language models struggle to progress beyond the beginning of each game.
- Score: 8.5302862604852
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing model, Gemini 2.5 Pro, completes only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.
Related papers
- AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games [63.29377274531968]
We introduce the AI GameStore, a scalable and open-ended platform to synthesize new representative human games.<n>We generate 100 such games based on the top charts of Apple App Store and Steam, and evaluate seven frontier vision-language models (VLMs) on short episodes of play.<n>The best models achieved less than 10% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning.
arXiv Detail & Related papers (2026-02-19T18:17:25Z) - Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos [82.4003989236851]
We propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding.<n>We introduce PhysGame, a dataset containing 140,057 glitch-centric question-answer pairs across five physical domains and sixteen fine-grained categories.<n>Experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real world physical reasoning performance of Qwen2.5VL by 2.5%, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark.
arXiv Detail & Related papers (2026-01-23T06:02:07Z) - NitroGen: An Open Foundation Model for Generalist Gaming Agents [101.41866522979548]
NitroGen is a vision-action foundation model for generalist gaming agents.<n>It is trained on 40,000 hours of gameplay videos across more than 1,000 games.
arXiv Detail & Related papers (2026-01-04T16:24:50Z) - FAIRGAMER: Evaluating Biases in the Application of Large Language Models to Video Games [9.989488318132539]
We show that Large Language Models' inherent social biases can directly damage game balance in real-world gaming environments.<n>We present FairGamer, the first bias evaluation Benchmark for LLMs in video game scenarios.
arXiv Detail & Related papers (2025-08-25T09:26:19Z) - Matrix-Game: Interactive World Foundation Model [11.144250200432458]
Matrix-Game is an interactive world foundation model for controllable game world generation.<n>Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions.<n>With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements.
arXiv Detail & Related papers (2025-06-23T14:40:49Z) - lmgame-Bench: How Good are LLMs at Playing Games? [60.01834131847881]
We study the major challenges in using popular video games to evaluate modern large language model (LLM) agents.<n>We introduce lmgame-Bench to turn games into reliable evaluations.
arXiv Detail & Related papers (2025-05-21T06:02:55Z) - Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning [89.93384726755106]
Vision-language reinforcement learning (RL) has primarily focused on narrow domains.<n>We find video games inherently provide rich visual elements and mechanics that are easy to verify.<n>To fully use the multimodal and verifiable reward in video games, we propose Game-RL.
arXiv Detail & Related papers (2025-05-20T03:47:44Z) - AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction [58.240114139186275]
Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation.<n>We propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state.<n>We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips.
arXiv Detail & Related papers (2025-04-01T17:57:18Z) - Cultivating Game Sense for Yourself: Making VLMs Gaming Experts [23.370716496046217]
We propose a paradigm shift in gameplay agent design.<n>Instead of directly controlling gameplay, VLM develops specialized execution modules tailored for tasks like shooting and combat.<n>These modules handle real-time game interactions, elevating VLM to a high-level developer.
arXiv Detail & Related papers (2025-03-27T08:40:47Z) - Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case [20.14197375326218]
This research aims to provide new insights and directions for applying multimodal agents in complex action game environments.
We select an ARPG, Black Myth: Wukong'', as a research platform to explore the capability boundaries of existing vision language models.
We will release a human operation dataset containing recorded gameplay videos and operation logs, including mouse and keyboard actions.
arXiv Detail & Related papers (2024-09-19T16:30:25Z) - GameEval: Evaluating LLMs on Conversational Games [93.40433639746331]
We propose GameEval, a novel approach to evaluating large language models (LLMs)
GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms.
We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
arXiv Detail & Related papers (2023-08-19T14:33:40Z) - Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion
Models [68.85478477006178]
We present a Promptable Game Model (PGM) for neural video game simulators.
It allows a user to play the game by prompting it with high- and low-level action sequences.
Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt.
Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art.
arXiv Detail & Related papers (2023-03-23T17:43:17Z) - WinoGAViL: Gamified Association Benchmark to Challenge
Vision-and-Language Models [91.92346150646007]
In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations.
We use the game to collect 3.5K instances, finding that they are intuitive for humans but challenging for state-of-the-art AI models.
Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills.
arXiv Detail & Related papers (2022-07-25T23:57:44Z) - Mastering Atari with Discrete World Models [61.7688353335468]
We introduce DreamerV2, a reinforcement learning agent that learns behaviors purely from predictions in the compact latent space of a powerful world model.
DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model.
arXiv Detail & Related papers (2020-10-05T17:52:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.