Related papers: FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

URL: http://arxiv.org/abs/2509.01052v2
Date: Wed, 15 Oct 2025 10:33:27 GMT
Title: FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games
Authors: Jaewoo Ahn, Junseo Kim, Heeseung Yun, Jaehyeon Son, Dongmin Park, Jaewoong Cho, Gunhee Kim,
Abstract summary: We introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion.<n>We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory.<n> Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap.
Score: 56.81554611870848
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

Related papers

GameDevBench: Evaluating Agentic Capabilities Through Game Development [49.19956546746812]
Game development provides such a testbed as agents must navigate large, denses while manipulating intrinsically multimodal assets.<n>We present GameDevBench, the first benchmark for evaluating agents on game development tasks.<n>Agents still struggle with game development, with the best agent solving only 54.5% of tasks.
arXiv Detail & Related papers (2026-02-11T18:15:11Z)
EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents [52.567469286881426]
We introduce EMemBench, a programmatic benchmark for evaluating long-term memory of agents through interactive games.<n>Rather than using a fixed set of questions, EMemBench generates questions from each agent's own trajectory.<n>Each template computes verifiable ground truth from underlying game signals.
arXiv Detail & Related papers (2026-01-23T12:09:59Z)
Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents [56.25101378553328]
We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned keyboard-mouse inputs.<n>Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data.<n> Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks.
arXiv Detail & Related papers (2025-10-27T17:43:51Z)
MIMIC: Integrating Diverse Personality Traits for Better Game Testing Using Large Language Model [9.426130742272715]
MIMIC is a novel framework that integrates diverse personality traits into gaming agents.<n>It can achieve higher test coverage and richer in-game interactions across different games.<n>It also outperforms state-of-the-art agents in Minecraft by achieving a higher task completion rate.
arXiv Detail & Related papers (2025-10-02T03:30:00Z)
You Have Thirteen Hours in Which to Solve the Labyrinth: Enhancing AI Game Masters with Function Calling [35.721053667746716]
This paper presents a novel approach to enhance AI game masters by leveraging function calling in the context of the table-top role-playing game "Jim Henson's Labyrinth: The Adventure Game" Our methodology involves integrating game-specific controls through functions, which we show improves the narrative quality and state update consistency of the AI game master.
arXiv Detail & Related papers (2024-09-11T02:03:51Z)
A Survey on Large Language Model-Based Game Agents [35.34074811680046]
Game agents offer a valuable testbed for exploring capabilities relevant to Artificial General Intelligence.<n>Recently, the emergence of Large Language Models (LLMs) provides new opportunities to endow these agents with generalizable reasoning.<n>This survey offers an up-to-date review of LLM-based game agents through a unified reference architecture.
arXiv Detail & Related papers (2024-04-02T15:34:18Z)
Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation [52.930183136111864]
We propose using scorable negotiation to evaluate Large Language Models (LLMs) To reach an agreement, agents must have strong arithmetic, inference, exploration, and planning capabilities. We provide procedures to create new games and increase games' difficulty to have an evolving benchmark.
arXiv Detail & Related papers (2023-09-29T13:33:06Z)
Preference-conditioned Pixel-based AI Agent For Game Testing [1.5059676044537105]
Game-testing AI agents that learn by interaction with the environment have the potential to mitigate these challenges. This paper proposes an agent design that mainly depends on pixel-based state observations while exploring the environment conditioned on a user's preference. Our agent significantly outperforms state-of-the-art pixel-based game testing agents over exploration coverage and test execution quality when evaluated on a complex open-world environment resembling many aspects of real AAA games.
arXiv Detail & Related papers (2023-08-18T04:19:36Z)
Tachikuma: Understading Complex Interactions with Multi-Character and Novel Objects by Large Language Models [67.20964015591262]
We introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation task and a supporting dataset. The dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations. We present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding.
arXiv Detail & Related papers (2023-07-24T07:40:59Z)
SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM) In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z)
Go-Explore Complex 3D Game Environments for Automated Reachability Testing [4.322647881761983]
We propose an approach specifically targeted at reachability bugs in simulated 3D environments based on the powerful exploration algorithm, Go-Explore. Go-Explore saves unique checkpoints across the map and then identifies promising ones to explore from. Our algorithm can fully cover a vast 1.5km x 1.5km game world within 10 hours on a single machine.
arXiv Detail & Related papers (2022-09-01T16:31:37Z)
Off-Beat Multi-Agent Reinforcement Learning [62.833358249873704]
We investigate model-free multi-agent reinforcement learning (MARL) in environments where off-beat actions are prevalent. We propose a novel episodic memory, LeGEM, for model-free MARL algorithms. We evaluate LeGEM on various multi-agent scenarios with off-beat actions, including Stag-Hunter Game, Quarry Game, Afforestation Game, and StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2022-05-27T02:21:04Z)
An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games [79.23847247132345]
This work investigates how well an artificial agent can benefit from playing guessing games when later asked to perform on novel NLP downstream tasks such as Visual Question Answering (VQA) We propose two ways to exploit playing guessing games: 1) a supervised learning scenario in which the agent learns to mimic successful guessing games and 2) a novel way for an agent to play by itself, called Self-play via Iterated Experience Learning (SPIEL)
arXiv Detail & Related papers (2021-01-31T10:30:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.