Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
- URL: http://arxiv.org/abs/2406.11012v5
- Date: Mon, 15 Jul 2024 16:17:52 GMT
- Title: Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
- Authors: Prisha Samadarshi, Mariam Mustafa, Anushka Kulkarni, Raven Rothkopf, Tuhin Chakrabarty, Smaranda Muresan,
- Abstract summary: We evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players.
Our results show that even the best-performing LLM, GPT-4o, can only fully solve 8% of the games.
- Score: 20.64536059771047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 200 Connections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best-performing LLM, GPT-4o, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 8% of the games. Compared to GPT-4o, novice and expert players perform better, with expert human players significantly outperforming GPT-4o. To deepen our understanding we create a taxonomy of the knowledge types required to successfully categorize words in the Connections game, revealing that LLMs struggle with associative, encyclopedic, and linguistic knowledge. Our findings establish the New York Times Connections game as a challenging benchmark for evaluating abstract reasoning capabilities in humans and AI systems.
Related papers
- Making New Connections: LLMs as Puzzle Generators for The New York Times' Connections Word Game [6.136654326170453]
The Connections puzzle is a word association game published daily by The New York Times (NYT)
generating novel puzzles requires a form of metacognition: generators must be able to accurately model the downstream reasoning of potential solvers.
Our findings show that LLMs are capable puzzle creators, and can generate diverse sets of enjoyable, challenging, and creative Connections puzzles as judged by human users.
arXiv Detail & Related papers (2024-07-15T21:05:25Z) - When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives [46.04238534224658]
We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives.
We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives.
Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns.
arXiv Detail & Related papers (2024-06-17T20:49:35Z) - Missed Connections: Lateral Thinking Puzzles for Large Language Models [2.1374208474242815]
The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme.
We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning.
arXiv Detail & Related papers (2024-04-17T20:31:05Z) - Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy''
We develop DEEP to evaluate LLMs' expression and disguising abilities.
We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z) - GameEval: Evaluating LLMs on Conversational Games [93.40433639746331]
We propose GameEval, a novel approach to evaluating large language models (LLMs)
GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms.
We show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.
arXiv Detail & Related papers (2023-08-19T14:33:40Z) - SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM)
In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment.
Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z) - JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions [75.42526766746515]
We propose a new commonsense reasoning dataset based on human's Interactive Fiction (IF) gameplay walkthroughs.
Our dataset focuses on the assessment of functional commonsense knowledge rules rather than factual knowledge.
Experiments show that the introduced dataset is challenging to previous machine reading models as well as the new large language models.
arXiv Detail & Related papers (2022-10-18T19:20:53Z) - WinoGAViL: Gamified Association Benchmark to Challenge
Vision-and-Language Models [91.92346150646007]
In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations.
We use the game to collect 3.5K instances, finding that they are intuitive for humans but challenging for state-of-the-art AI models.
Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills.
arXiv Detail & Related papers (2022-07-25T23:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.