GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
- URL: http://arxiv.org/abs/2402.12348v2
- Date: Mon, 10 Jun 2024 17:14:09 GMT
- Title: GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
- Authors: Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, Kaidi Xu,
- Abstract summary: Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
- Score: 87.99872683336395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we (1) Characterize the game-theoretic reasoning of LLMs; and (2) Perform LLM-vs.-LLM competitions as reasoning evaluation. We observe that (1) LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are competitive in probabilistic gaming scenarios; (2) Most open-source LLMs, e.g., CodeLlama-34b-Instruct and Llama-2-70b-chat, are less competitive than commercial LLMs, e.g., GPT-4, in complex games, yet the recently released Llama-3-70b-Instruct makes up for this shortcoming. In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. We further characterize the game-theoretic properties of LLMs, such as equilibrium and Pareto Efficiency in repeated games. Detailed error profiles are provided for a better understanding of LLMs' behavior. We hope our research provides standardized protocols and serves as a foundation to spur further explorations in the strategic reasoning of LLMs.
Related papers
- LLMs May Not Be Human-Level Players, But They Can Be Testers: Measuring Game Difficulty with LLM Agents [10.632179121247466]
We propose a general game-testing framework using LLM agents and test it on two widely played strategy games: Wordle and Slay the Spire.
Our results reveal an interesting finding: although LLMs may not perform as well as the average human player, their performance, when guided by simple, generic prompting techniques, shows a statistically significant and strong correlation with difficulty indicated by human players.
This suggests that LLMs could serve as effective agents for measuring game difficulty during the development process.
arXiv Detail & Related papers (2024-10-01T18:40:43Z) - LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models [87.49676980090555]
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities.
We introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs.
arXiv Detail & Related papers (2024-08-28T13:16:41Z) - Large Language Models Playing Mixed Strategy Nash Equilibrium Games [1.060608983034705]
This paper focuses on the capabilities of Large Language Models to find the Nash equilibrium in games with a mixed strategy Nash equilibrium and no pure strategy Nash equilibrium.
The study reveals a significant enhancement in the performance of LLMs when they are equipped with the possibility to run code.
It is evident that while LLMs exhibit remarkable proficiency in well-known standard games, their performance dwindles when faced with slight modifications of the same games.
arXiv Detail & Related papers (2024-06-15T09:30:20Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - Do Large Language Models Understand Logic or Just Mimick Context? [14.081178100662163]
This paper investigates the reasoning capabilities of large language models (LLMs) on two logical reasoning datasets.
It is found that LLMs do not truly understand logical rules; rather, in-context learning has simply enhanced the likelihood of these models arriving at the correct answers.
arXiv Detail & Related papers (2024-02-19T12:12:35Z) - ALYMPICS: LLM Agents Meet Game Theory -- Exploring Strategic
Decision-Making with AI Agents [77.34720446306419]
Alympics is a systematic simulation framework utilizing Large Language Model (LLM) agents for game theory research.
Alympics creates a versatile platform for studying complex game theory problems.
arXiv Detail & Related papers (2023-11-06T16:03:46Z) - Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy''
We develop DEEP to evaluate LLMs' expression and disguising abilities.
We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z) - SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM)
In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment.
Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.