Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay
- URL: http://arxiv.org/abs/2407.11068v3
- Date: Sun, 18 Aug 2024 19:44:35 GMT
- Title: Show, Don't Tell: Evaluating Large Language Models Beyond Textual Understanding with ChildPlay
- Authors: Gonçalo Hora de Carvalho, Oscar Knap, Robert Pollice,
- Abstract summary: We use games like Tic-Tac-Toe, Connect Four, and Battleship to assess strategic thinking and decision-making.
Despite their proficiency on standard benchmarks, GPT-3.5 and GPT-4's abilities to play and reason about fully observable games without pre-training is mediocre.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We explore the hypothesis that LLMs, such as GPT-3.5 and GPT-4, possess broader cognitive functions, particularly in non-linguistic domains. Our approach extends beyond standard linguistic benchmarks by incorporating games like Tic-Tac-Toe, Connect Four, and Battleship, encoded via ASCII, to assess strategic thinking and decision-making. To evaluate the models' ability to generalize beyond their training data, we introduce two additional games. The first game, LEGO Connect Language (LCL), tests the models' capacity to understand spatial logic and follow assembly instructions. The second game, the game of shapes, challenges the models to identify shapes represented by 1s within a matrix of zeros, further testing their spatial reasoning skills. This "show, don't tell" strategy uses games instead of simply querying the models. Our results show that despite their proficiency on standard benchmarks, GPT-3.5 and GPT-4's abilities to play and reason about fully observable games without pre-training is mediocre. Both models fail to anticipate losing moves in Tic-Tac-Toe and Connect Four, and they are unable to play Battleship correctly. While GPT-4 shows some success in the game of shapes, both models fail at the assembly tasks presented in the LCL game. These results suggest that while GPT models can emulate conversational proficiency and basic rule comprehension, their performance in strategic gameplay and spatial reasoning tasks is very limited. Importantly, this reveals a blind spot in current LLM benchmarks that we highlight with our gameplay benchmark suite ChildPlay (https://github.com/child-play-neurips/child-play). Our findings provide a cautionary tale about claims of emergent intelligence and reasoning capabilities of LLMs that are roughly the size of GPT-3.5 and GPT-4.
Related papers
- GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps [5.874552372073687]
Large language models (LLMs) have recently demonstrated great success in generating and understanding natural language.
We propose GameTraversalBenchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps.
GPT-4-Turbo achieved the highest score of 44.97% on GTB_Score (GTBS), a composite score that combines the three above criteria.
arXiv Detail & Related papers (2024-10-10T09:54:28Z) - Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard [0.0]
We introduce a novel benchmark for large language models (LLMs) through grid-based games such as Tic-Tac-Toe, Connect Four, and Gomoku.
The open-source game simulation code available on GitHub allows LLMs to compete and generates detailed data files.
We present the results of games among leading LLMs, including Claude 3.5 Sonnet and Claude 3 Sonnet by Anthropic, Gemini 1.5 Pro and Gemini Flash by Google, GPT-4 Turbo and GPT-4o by OpenAI, and Llama3-70B by Meta.
arXiv Detail & Related papers (2024-07-10T16:14:34Z) - Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored.
This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma.
Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z) - GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents [4.209869303518743]
We introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of large language models.
Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP)
Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action.
arXiv Detail & Related papers (2024-06-07T00:28:43Z) - Will GPT-4 Run DOOM? [0.0]
We show that GPT-4's reasoning and planning capabilities extend to the 1993 first-person shooter Doom.
We find that GPT-4 can play the game to a passable degree: it is able to manipulate doors, combat enemies, and perform pathing.
arXiv Detail & Related papers (2024-03-08T17:30:41Z) - Loose LIPS Sink Ships: Asking Questions in Battleship with Language-Informed Program Sampling [80.64715784334936]
We study tradeoffs in a classic grounded question-asking task based on the board game Battleship.
Our model uses large language models (LLMs) to generate natural language questions, translate them into symbolic programs, and evaluate their expected information gain.
We find that with a surprisingly modest resource budget, this simple Monte Carlo optimization strategy yields informative questions that mirror human performance.
arXiv Detail & Related papers (2024-02-29T18:58:15Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - How FaR Are Large Language Models From Agents with Theory-of-Mind? [69.41586417697732]
We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D)
T4D requires models to connect inferences about others' mental states to actions in social scenarios.
We introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges.
arXiv Detail & Related papers (2023-10-04T06:47:58Z) - Strategic Behavior of Large Language Models: Game Structure vs.
Contextual Framing [0.0]
This paper investigates the strategic decision-making capabilities of three Large Language Models (LLMs): GPT-3.5, GPT-4, and LLaMa-2.
Utilizing four canonical two-player games, we explore how these models navigate social dilemmas.
arXiv Detail & Related papers (2023-09-12T00:54:15Z) - SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM)
In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment.
Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z) - Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion
Models [68.85478477006178]
We present a Promptable Game Model (PGM) for neural video game simulators.
It allows a user to play the game by prompting it with high- and low-level action sequences.
Most captivatingly, our PGM unlocks the director's mode, where the game is played by specifying goals for the agents in the form of a prompt.
Our method significantly outperforms existing neural video game simulators in terms of rendering quality and unlocks applications beyond the capabilities of the current state of the art.
arXiv Detail & Related papers (2023-03-23T17:43:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.