Missed Connections: Lateral Thinking Puzzles for Large Language Models
- URL: http://arxiv.org/abs/2404.11730v2
- Date: Sun, 21 Apr 2024 15:38:19 GMT
- Title: Missed Connections: Lateral Thinking Puzzles for Large Language Models
- Authors: Graham Todd, Tim Merino, Sam Earle, Julian Togelius,
- Abstract summary: The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme.
We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning.
- Score: 2.1374208474242815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. Solving the puzzle requires both common linguistic knowledge (i.e. definitions and typical usage) as well as, in many cases, lateral or abstract thinking. This is because the four categories ascend in complexity, with the most challenging category often requiring thinking about words in uncommon ways or as parts of larger phrases. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning and a way to measure the semantic information encoded by data-driven linguistic systems. In particular, we study both a sentence-embedding baseline and modern large language models (LLMs). We report their accuracy on the task, measure the impacts of chain-of-thought prompting, and discuss their failure modes. Overall, we find that the Connections task is challenging yet feasible, and a strong test-bed for future work.
Related papers
- EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges [17.056693711040747]
We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events.
This dataset probes models' ability to perform implicit knowledge synthesis and multi-step deductive reasoning.
The benchmark comprises 1184 puzzles of varying complexity requiring teams of skilled solvers hours to days to complete.
arXiv Detail & Related papers (2025-02-13T00:18:34Z) - Making New Connections: LLMs as Puzzle Generators for The New York Times' Connections Word Game [6.136654326170453]
The Connections puzzle is a word association game published daily by The New York Times (NYT)
generating novel puzzles requires a form of metacognition: generators must be able to accurately model the downstream reasoning of potential solvers.
Our findings show that LLMs are capable puzzle creators, and can generate diverse sets of enjoyable, challenging, and creative Connections puzzles as judged by human users.
arXiv Detail & Related papers (2024-07-15T21:05:25Z) - Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game [20.64536059771047]
We evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players.
Our results show that even the best performing LLM, Claude 3.5 Sonnet, can only fully solve 18% of the games.
We create a taxonomy of the knowledge types required to successfully cluster and categorize words in the Connections game.
arXiv Detail & Related papers (2024-06-16T17:10:32Z) - Language Models are Crossword Solvers [1.53744306569115]
We tackle the challenge of solving crosswords with large language models (LLMs)
We demonstrate that the current generation of language models shows significant competence at deciphering cryptic crossword clues.
We also develop a search algorithm that builds off this performance to tackle the problem of solving full crossword grids with out-of-the-box LLMs.
arXiv Detail & Related papers (2024-06-13T12:29:27Z) - PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [69.17409440805498]
We evaluate large multimodal models with abstract patterns based on fundamental concepts.
We find that they are not able to generalize well to simple abstract patterns.
Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities.
arXiv Detail & Related papers (2024-03-20T05:37:24Z) - BRAINTEASER: Lateral Thinking Puzzles for Large Language Models [15.95314613982879]
BRAINTEASER is a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking.
Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance.
We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.
arXiv Detail & Related papers (2023-10-08T07:46:01Z) - Brain in a Vat: On Missing Pieces Towards Artificial General
Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents.
We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations.
We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z) - Tree of Thoughts: Deliberate Problem Solving with Large Language Models [52.31950122881687]
We introduce a new framework for language model inference, Tree of Thoughts (ToT)
ToT generalizes over the popular Chain of Thought approach to prompting language models.
Our experiments show that ToT significantly enhances language models' problem-solving abilities.
arXiv Detail & Related papers (2023-05-17T23:16:17Z) - Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw
Puzzles [67.39567701983357]
Video Anomaly Detection (VAD) is an important topic in computer vision.
Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task.
Our method outperforms state-of-the-art counterparts on three public benchmarks.
arXiv Detail & Related papers (2022-07-20T19:49:32Z) - KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT)
All tasks in KILT are grounded in the same snapshot of Wikipedia.
We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z) - PuzzLing Machines: A Challenge on Learning From Small Data [64.513459448362]
We introduce a challenge on learning from small data, PuzzLing Machines, which consists of Rosetta Stone puzzles from Linguistic Olympiads for high school students.
Our challenge contains around 100 puzzles covering a wide range of linguistic phenomena from 81 languages.
We show that both simple statistical algorithms and state-of-the-art deep neural models perform inadequately on this challenge, as expected.
arXiv Detail & Related papers (2020-04-27T20:34:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.