Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
- URL: http://arxiv.org/abs/2406.11012v7
- Date: Mon, 14 Oct 2024 03:41:26 GMT
- Title: Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
- Authors: Prisha Samadarshi, Mariam Mustafa, Anushka Kulkarni, Raven Rothkopf, Tuhin Chakrabarty, Smaranda Muresan,
- Abstract summary: We evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players.
Our results show that even the best performing LLM, Claude 3.5 Sonnet, can only fully solve 18% of the games.
We create a taxonomy of the knowledge types required to successfully cluster and categorize words in the Connections game.
- Score: 20.64536059771047
- License:
- Abstract: The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 438 Connections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best performing LLM, Claude 3.5 Sonnet, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 18% of the games. Novice and expert players perform better than Claude 3.5 Sonnet, with expert human players significantly outperforming it. We create a taxonomy of the knowledge types required to successfully cluster and categorize words in the Connections game. We find that while LLMs perform relatively well on categorizing words based on semantic relations they struggle with other types of knowledge such as Encyclopedic Knowledge, Multiword Expressions or knowledge that combines both Word Form and Meaning. Our results establish the New York Times Connections game as a challenging benchmark for evaluating abstract reasoning capabilities in AI systems.
Related papers
- Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests [89.09172401497213]
We examine three evaluation paradigms: large question-answering benchmarks, interactive games, and cognitive tests.
We compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use.
Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models.
arXiv Detail & Related papers (2025-02-20T08:36:58Z) - Codenames as a Benchmark for Large Language Models [2.1028463367241033]
We use the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models.
We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1.
Our results indicate that while certain LLMs perform better than others overall, different models exhibit varying emergent behaviours during gameplay and excel at specific roles.
arXiv Detail & Related papers (2024-12-16T01:59:03Z) - NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers [5.397565689903148]
We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game.
This benchmark is designed to penalize quick, intuitive "System 1" thinking, isolating fundamental reasoning skills.
arXiv Detail & Related papers (2024-12-02T15:41:47Z) - Making New Connections: LLMs as Puzzle Generators for The New York Times' Connections Word Game [6.136654326170453]
The Connections puzzle is a word association game published daily by The New York Times (NYT)
generating novel puzzles requires a form of metacognition: generators must be able to accurately model the downstream reasoning of potential solvers.
Our findings show that LLMs are capable puzzle creators, and can generate diverse sets of enjoyable, challenging, and creative Connections puzzles as judged by human users.
arXiv Detail & Related papers (2024-07-15T21:05:25Z) - Missed Connections: Lateral Thinking Puzzles for Large Language Models [2.1374208474242815]
The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme.
We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning.
arXiv Detail & Related papers (2024-04-17T20:31:05Z) - Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy''
We develop DEEP to evaluate LLMs' expression and disguising abilities.
We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z) - SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM)
In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment.
Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z) - JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions [75.42526766746515]
We propose a new commonsense reasoning dataset based on human's Interactive Fiction (IF) gameplay walkthroughs.
Our dataset focuses on the assessment of functional commonsense knowledge rules rather than factual knowledge.
Experiments show that the introduced dataset is challenging to previous machine reading models as well as the new large language models.
arXiv Detail & Related papers (2022-10-18T19:20:53Z) - WinoGAViL: Gamified Association Benchmark to Challenge
Vision-and-Language Models [91.92346150646007]
In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations.
We use the game to collect 3.5K instances, finding that they are intuitive for humans but challenging for state-of-the-art AI models.
Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills.
arXiv Detail & Related papers (2022-07-25T23:57:44Z) - Contextual Games: Multi-Agent Learning with Side Information [57.76996806603094]
We formulate the novel class of contextual games driven by contextual information at each round.
By means of kernel-based regularity assumptions, we model the correlation between different contexts and game outcomes.
We propose a novel online (meta) algorithm that exploits such correlations to minimize the contextual regret of individual players.
arXiv Detail & Related papers (2021-07-13T18:37:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.