Related papers: TextQuests: How Good are LLMs at Text-Based Video Games?

TextQuests: How Good are LLMs at Text-Based Video Games?

URL: http://arxiv.org/abs/2507.23701v1
Date: Thu, 31 Jul 2025 16:22:55 GMT
Title: TextQuests: How Good are LLMs at Text-Based Video Games?
Authors: Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks,
Abstract summary: TextQuests is a benchmark based on the Infocom suite of interactive fiction games.<n>It is designed to assess an agent's capacity for self-contained problem-solving by precluding the use of external tools.
Score: 36.024745739590216
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent's ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent's capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.

Related papers

From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information? [34.959850282872594]
We present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills.<n>AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers.<n> Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning.
arXiv Detail & Related papers (2025-06-09T23:56:41Z)
A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions [51.96890647837277]
Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users.<n>This survey paper presents a desideratum for next-generation Conversational Agents - what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human-level intelligence.
arXiv Detail & Related papers (2025-04-07T21:01:25Z)
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios [38.878966229688054]
We introduce AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios. Drawing on Dramaturgical Theory, AgentSense employs a bottom-up approach to create 1,225 diverse social scenarios constructed from extensive scripts. We analyze goals using ERG theory and conduct comprehensive experiments. Our findings highlight that LLMs struggle with goals in complex social scenarios, especially high-level growth needs, and even GPT-4o requires improvement in private information reasoning.
arXiv Detail & Related papers (2024-10-25T07:04:16Z)
A Survey on Complex Tasks for Goal-Directed Interactive Agents [60.53915548970061]
This survey compiles relevant tasks and environments for evaluating goal-directed interactive agents. An up-to-date compilation of relevant resources can be found on our project website.
arXiv Detail & Related papers (2024-09-27T08:17:53Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
STARLING: Self-supervised Training of Text-based Reinforcement Learning Agent with Large Language Models [5.786039929801102]
Existing environments for interactive fiction games are domain-specific or time-consuming to generate and do not train the RL agents to master a specific set of skills. We introduce an interactive environment for self-supervised RL, STARLING, for text-based games that bootstraps the text-based RL agents with automatically generated games to boost the performance and generalization capabilities to reach a goal of the target environment.
arXiv Detail & Related papers (2024-06-09T18:07:47Z)
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z)
Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games [26.07074182316433]
We introduce the first dataset specifically for Jubensha, including character scripts and game rules. Our work also presents a unique multi-agent interaction framework using LLMs, allowing AI agents to autonomously engage in this game. To evaluate the gaming performance of these AI agents, we developed novel methods measuring their mastery of case information and reasoning skills.
arXiv Detail & Related papers (2023-12-01T17:33:57Z)
WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible. We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains. We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z)
An Analysis of Deep Reinforcement Learning Agents for Text-based Games [4.9702715037812055]
Text-based games (TBG) are complex environments which allow users or computer agents to make textual interactions and achieve game goals. Finding TBG agent deep learning modules' performance in standardized environments, and testing their performance among different evaluation types is also important for TBG agent research. We constructed a standardized TBG agent with no hand-crafted rules, formally categorized TBG evaluation types, and analyzed selected methods in our environment.
arXiv Detail & Related papers (2022-09-09T03:36:06Z)
Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents [21.29303927728839]
We identify key properties of text worlds that make them suitable for exploration by autonmous agents. We discuss the opportunities of using autonomous agents to make progress on text environment benchmarks.
arXiv Detail & Related papers (2022-07-08T20:31:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.