Related papers: Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game

Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game

URL: http://arxiv.org/abs/2406.11012v5
Date: Mon, 15 Jul 2024 16:17:52 GMT
Title: Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
Authors: Prisha Samadarshi, Mariam Mustafa, Anushka Kulkarni, Raven Rothkopf, Tuhin Chakrabarty, Smaranda Muresan,
Abstract summary: We evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best-performing LLM, GPT-4o, can only fully solve 8% of the games.
Score: 20.64536059771047
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 200 Connections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best-performing LLM, GPT-4o, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 8% of the games. Compared to GPT-4o, novice and expert players perform better, with expert human players significantly outperforming GPT-4o. To deepen our understanding we create a taxonomy of the knowledge types required to successfully categorize words in the Connections game, revealing that LLMs struggle with associative, encyclopedic, and linguistic knowledge. Our findings establish the New York Times Connections game as a challenging benchmark for evaluating abstract reasoning capabilities in humans and AI systems.

Related papers

Who is a Better Player: LLM against LLM [53.46608216197315]
We propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition.<n>We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players.
arXiv Detail & Related papers (2025-08-05T06:41:47Z)
Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests [89.09172401497213]
We examine three evaluation paradigms: large question-answering benchmarks, interactive games, and cognitive tests. We compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models.
arXiv Detail & Related papers (2025-02-20T08:36:58Z)
Codenames as a Benchmark for Large Language Models [2.1028463367241033]
We use the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models. We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1. Our results indicate that while certain LLMs perform better than others overall, different models exhibit varying emergent behaviours during gameplay and excel at specific roles.
arXiv Detail & Related papers (2024-12-16T01:59:03Z)
NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers [5.397565689903148]
We present NYT-Connections, a collection of 358 simple word classification puzzles derived from the New York Times Connections game. This benchmark is designed to penalize quick, intuitive "System 1" thinking, isolating fundamental reasoning skills.
arXiv Detail & Related papers (2024-12-02T15:41:47Z)
Making New Connections: LLMs as Puzzle Generators for The New York Times' Connections Word Game [6.136654326170453]
The Connections puzzle is a word association game published daily by The New York Times (NYT) generating novel puzzles requires a form of metacognition: generators must be able to accurately model the downstream reasoning of potential solvers. Our findings show that LLMs are capable puzzle creators, and can generate diverse sets of enjoyable, challenging, and creative Connections puzzles as judged by human users.
arXiv Detail & Related papers (2024-07-15T21:05:25Z)
Language Models are Crossword Solvers [1.53744306569115]
We tackle the challenge of solving crosswords with Large Language Models (LLMs) We demonstrate that the current generation of state-of-the art (SoTA) language models show significant competence at deciphering cryptic crossword clues. We also develop a search algorithm that builds off this performance to tackle the problem of solving full crossword grids with LLMs.
arXiv Detail & Related papers (2024-06-13T12:29:27Z)
Missed Connections: Lateral Thinking Puzzles for Large Language Models [2.1374208474242815]
The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning.
arXiv Detail & Related papers (2024-04-17T20:31:05Z)
Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models [105.39236338147715]
The paper is inspired by the popular language game Who is Spy'' We develop DEEP to evaluate LLMs' expression and disguising abilities. We then introduce SpyGame, an interactive multi-agent framework.
arXiv Detail & Related papers (2023-10-31T14:37:42Z)
SPRING: Studying the Paper and Reasoning to Play Games [102.5587155284795]
We propose a novel approach, SPRING, to read the game's original academic paper and use the knowledge learned to reason and play the game through a large language model (LLM) In experiments, we study the quality of in-context "reasoning" induced by different forms of prompts under the setting of the Crafter open-world environment. Our experiments suggest that LLMs, when prompted with consistent chain-of-thought, have great potential in completing sophisticated high-level trajectories.
arXiv Detail & Related papers (2023-05-24T18:14:35Z)
JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions [75.42526766746515]
We propose a new commonsense reasoning dataset based on human's Interactive Fiction (IF) gameplay walkthroughs. Our dataset focuses on the assessment of functional commonsense knowledge rules rather than factual knowledge. Experiments show that the introduced dataset is challenging to previous machine reading models as well as the new large language models.
arXiv Detail & Related papers (2022-10-18T19:20:53Z)
WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models [91.92346150646007]
In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations. We use the game to collect 3.5K instances, finding that they are intuitive for humans but challenging for state-of-the-art AI models. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills.
arXiv Detail & Related papers (2022-07-25T23:57:44Z)
Contextual Games: Multi-Agent Learning with Side Information [57.76996806603094]
We formulate the novel class of contextual games driven by contextual information at each round. By means of kernel-based regularity assumptions, we model the correlation between different contexts and game outcomes. We propose a novel online (meta) algorithm that exploits such correlations to minimize the contextual regret of individual players.
arXiv Detail & Related papers (2021-07-13T18:37:37Z)
Playing Codenames with Language Graphs and Word Embeddings [21.358501003335977]
We propose an algorithm that can generate Codenames clues from the language graph BabelNet. We introduce a new scoring function that measures the quality of clues. We develop BabelNet-Word Selection Framework (BabelNet-WSF) to improve BabelNet clue quality.
arXiv Detail & Related papers (2021-05-12T18:23:03Z)
Deep Reinforcement Learning with Stacked Hierarchical Attention for Text-based Games [64.11746320061965]
We study reinforcement learning for text-based games, which are interactive simulations in the context of natural language. We aim to conduct explicit reasoning with knowledge graphs for decision making, so that the actions of an agent are generated and supported by an interpretable inference procedure. We extensively evaluate our method on a number of man-made benchmark games, and the experimental results demonstrate that our method performs better than existing text-based agents.
arXiv Detail & Related papers (2020-10-22T12:40:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.