LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents
- URL: http://arxiv.org/abs/2410.02829v1
- Date: Tue, 1 Oct 2024 18:40:43 GMT
- Title: LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents
- Authors: Chang Xiao, Brenda Z. Yang,
- Abstract summary: We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire.
Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players.
This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process.
- Score: 10.632179121247466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in Large Language Models (LLMs) have demonstrated their potential as autonomous agents across various tasks. One emerging application is the use of LLMs in playing games. In this work, we explore a practical problem for the gaming industry: Can LLMs be used to measure game difficulty? We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire. Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players. This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process. Based on our experiments, we also outline general principles and guidelines for incorporating LLMs into the game testing process.
Related papers
- Should You Use Your Large Language Model to Explore or Exploit? [55.562545113247666]
We evaluate the ability of large language models to help a decision-making agent facing an exploration-exploitation tradeoff.
We find that while the current LLMs often struggle to exploit, in-context mitigations may be used to substantially improve performance for small-scale tasks.
arXiv Detail & Related papers (2025-01-31T23:42:53Z) - Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.
We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.
Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z) - Are You Human? An Adversarial Benchmark to Expose LLMs [2.6528263069045126]
Large Language Models (LLMs) have demonstrated an alarming ability to impersonate humans in conversation.
We evaluate text-based prompts designed as challenges to expose LLM imposters in real-time.
arXiv Detail & Related papers (2024-10-12T15:33:50Z) - CUTE: Measuring LLMs' Understanding of Their Tokens [54.70665106141121]
Large Language Models (LLMs) show remarkable performance on a wide variety of tasks.
This raises the question: To what extent can LLMs learn orthographic information?
We propose a new benchmark, which features a collection of tasks designed to test the orthographic knowledge of LLMs.
arXiv Detail & Related papers (2024-09-23T18:27:03Z) - Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information [36.11862095329315]
Large language models (LLMs) have shown success in handling simple games with imperfect information.
This study investigates the applicability of knowledge acquired by open-source and API-based LLMs to sophisticated text-based games.
arXiv Detail & Related papers (2024-08-05T15:36:46Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - LLM-Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay [55.12945794835791]
Using Avalon as a testbed, we employ system prompts to guide LLM agents in gameplay.
We propose a novel framework, tailored for Avalon, features a multi-agent system facilitating efficient communication and interaction.
Results affirm the framework's effectiveness in creating adaptive agents and suggest LLM-based agents' potential in navigating dynamic social interactions.
arXiv Detail & Related papers (2023-10-23T14:35:26Z) - SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM)
In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment.
Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.