Related papers: LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents

LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents

URL: http://arxiv.org/abs/2410.02829v1
Date: Tue, 1 Oct 2024 18:40:43 GMT
Title: LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents
Authors: Chang Xiao, Brenda Z. Yang,
Abstract summary: We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire. Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players. This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process.
Score: 10.632179121247466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Large Language Models (LLMs) have demonstrated their potential as autonomous agents across various tasks. One emerging application is the use of LLMs in playing games. In this work, we explore a practical problem for the gaming industry: Can LLMs be used to measure game difficulty? We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire. Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players. This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process. Based on our experiments, we also outline general principles and guidelines for incorporating LLMs into the game testing process.

Related papers

Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go [74.28228642327726]
Large language models (LLMs) have demonstrated exceptional performance in reasoning tasks such as mathematics and coding.<n>LoGos is a powerful LLM that not only maintains outstanding general reasoning abilities, but also conducts Go gameplay in natural language.<n>LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs.
arXiv Detail & Related papers (2026-01-23T05:00:49Z)
Strategies of cooperation and defection in five large language models [1.0249415982296137]
Large language models (LLMs) are increasingly deployed to support human decision-making.<n>This paper explores whether five leading models produce sensible strategies in the repeated prisoner's dilemma.<n>Our experiments shed light on how current LLMs instantiate reciprocal cooperation.
arXiv Detail & Related papers (2026-01-14T20:13:23Z)
Can Large Language Models Master Complex Card Games? [18.39826127562161]
Large language models (LLMs) have exhibited remarkable capabilities across various tasks.<n>We show that LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data.<n>LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data.
arXiv Detail & Related papers (2025-09-01T10:11:56Z)
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games [16.187737674778234]
We present textbfbenchname, a benchmark designed to train and evaluate Large Language Model (LLM) agents across diverse real-world video games.<n>To support consistent evaluation of LLMs, we introduce a plug-and-play interface based on Model Context Protocol (MCP)<n>Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects.
arXiv Detail & Related papers (2025-06-04T06:40:33Z)
lmgame-Bench: How Good are LLMs at Playing Games? [60.01834131847881]
We study the major challenges in using popular video games to evaluate modern large language model (LLM) agents.<n>We introduce lmgame-Bench to turn games into reliable evaluations.
arXiv Detail & Related papers (2025-05-21T06:02:55Z)
Should You Use Your Large Language Model to Explore or Exploit? [55.562545113247666]
We evaluate the ability of large language models to help a decision-making agent facing an exploration-exploitation tradeoff. We find that while the current LLMs often struggle to exploit, in-context mitigations may be used to substantially improve performance for small-scale tasks.
arXiv Detail & Related papers (2025-01-31T23:42:53Z)
Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models. We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics. Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z)
Are You Human? An Adversarial Benchmark to Expose LLMs [2.6528263069045126]
Large Language Models (LLMs) have demonstrated an alarming ability to impersonate humans in conversation. We evaluate text-based prompts designed as challenges to expose LLM imposters in real-time.
arXiv Detail & Related papers (2024-10-12T15:33:50Z)
CUTE: Measuring LLMs' Understanding of Their Tokens [54.70665106141121]
Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. This raises the question: To what extent can LLMs learn orthographic information? We propose a new benchmark, which features a collection of tasks designed to test the orthographic knowledge of LLMs.
arXiv Detail & Related papers (2024-09-23T18:27:03Z)
AI Meets the Classroom: When Does ChatGPT Harm Learning? [0.0]
We study how generative AI and specifically large language models (LLMs) impact learning in coding classes. We show across three studies that LLM usage can have positive and negative effects on learning outcomes.
arXiv Detail & Related papers (2024-08-29T17:07:46Z)
Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information [36.11862095329315]
Large language models (LLMs) have shown success in handling simple games with imperfect information. This study investigates the applicability of knowledge acquired by open-source and API-based LLMs to sophisticated text-based games.
arXiv Detail & Related papers (2024-08-05T15:36:46Z)
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications. This paper evaluates LLMs' reasoning abilities in competitive environments. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z)
Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z)
LLM Augmented Hierarchical Agents [4.574041097539858]
Solving long-horizon, temporally-extended tasks using Reinforcement Learning (RL) is challenging, compounded by the common practice of learning without prior knowledge (or tabula rasa learning) In this paper we exploit the planning capabilities of LLMs while using RL to provide learning from the environment, resulting in a hierarchical agent that uses LLMs to solve long-horizon tasks. This approach is evaluated in simulation environments such as MiniGrid, SkillHack, and Crafter, and on a real robot arm in block manipulation tasks.
arXiv Detail & Related papers (2023-11-09T18:54:28Z)
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans. RaR serves as a simple yet effective prompting method for improving performance. We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z)
LLM-Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay [55.12945794835791]
Using Avalon as a testbed, we employ system prompts to guide LLM agents in gameplay. We propose a novel framework, tailored for Avalon, features a multi-agent system facilitating efficient communication and interaction. Results affirm the framework's effectiveness in creating adaptive agents and suggest LLM-based agents' potential in navigating dynamic social interactions.
arXiv Detail & Related papers (2023-10-23T14:35:26Z)
SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM) In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.