GameDevBench: Evaluating Agentic Capabilities Through Game Development
- URL: http://arxiv.org/abs/2602.11103v1
- Date: Wed, 11 Feb 2026 18:15:11 GMT
- Title: GameDevBench: Evaluating Agentic Capabilities Through Game Development
- Authors: Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten, Qiuhong Anna Wei, Runkun Chen, Alexander Wang, Valerie Chen, Ameet Talwalkar, Chris Donahue,
- Abstract summary: Game development provides such a testbed as agents must navigate large, denses while manipulating intrinsically multimodal assets.<n>We present GameDevBench, the first benchmark for evaluating agents on game development tasks.<n>Agents still struggle with game development, with the best agent solving only 54.5% of tasks.
- Score: 49.19956546746812
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.
Related papers
- NitroGen: An Open Foundation Model for Generalist Gaming Agents [101.41866522979548]
NitroGen is a vision-action foundation model for generalist gaming agents.<n>It is trained on 40,000 hours of gameplay videos across more than 1,000 games.
arXiv Detail & Related papers (2026-01-04T16:24:50Z) - InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search [48.79494320593913]
We introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details.<n>O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning.<n>We propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher)
arXiv Detail & Related papers (2025-12-21T14:23:07Z) - Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents [56.25101378553328]
We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned keyboard-mouse inputs.<n>Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data.<n> Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks.
arXiv Detail & Related papers (2025-10-27T17:43:51Z) - FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games [56.81554611870848]
We introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion.<n>We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory.<n> Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap.
arXiv Detail & Related papers (2025-09-01T01:33:16Z) - Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts [54.21319853862452]
We present Optimus-3, a general-purpose agent for Minecraft.<n>We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development.<n>We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity.
arXiv Detail & Related papers (2025-06-12T05:29:40Z) - Cultivating Game Sense for Yourself: Making VLMs Gaming Experts [23.370716496046217]
We propose a paradigm shift in gameplay agent design.<n>Instead of directly controlling gameplay, VLM develops specialized execution modules tailored for tasks like shooting and combat.<n>These modules handle real-time game interactions, elevating VLM to a high-level developer.
arXiv Detail & Related papers (2025-03-27T08:40:47Z) - TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft [40.419794780178044]
We present TeamCraft, a multi-modal multi-agent benchmark built on top of the open-world video game Minecraft.<n>The benchmark features 55,000 task variants specified by multi-modal prompts, procedurally-generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capabilities.<n>Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents.
arXiv Detail & Related papers (2024-12-06T18:41:16Z) - A Survey on Large Language Model-Based Game Agents [35.34074811680046]
Game agents offer a valuable testbed for exploring capabilities relevant to Artificial General Intelligence.<n>Recently, the emergence of Large Language Models (LLMs) provides new opportunities to endow these agents with generalizable reasoning.<n>This survey offers an up-to-date review of LLM-based game agents through a unified reference architecture.
arXiv Detail & Related papers (2024-04-02T15:34:18Z) - GameGPT: Multi-agent Collaborative Framework for Game Development [10.8750049774263]
Large language model (LLM) based agents have demonstrated their capacity to automate and expedite software development processes.<n>We propose a multi-agent collaborative framework, dubbed GameGPT, to automate game development.
arXiv Detail & Related papers (2023-10-12T06:31:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.