Missed Connections: Lateral Thinking Puzzles for Large Language Models
- URL: http://arxiv.org/abs/2404.11730v2
- Date: Sun, 21 Apr 2024 15:38:19 GMT
- Title: Missed Connections: Lateral Thinking Puzzles for Large Language Models
- Authors: Graham Todd, Tim Merino, Sam Earle, Julian Togelius,
- Abstract summary: The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme.
We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning.
- Score: 2.1374208474242815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. Solving the puzzle requires both common linguistic knowledge (i.e. definitions and typical usage) as well as, in many cases, lateral or abstract thinking. This is because the four categories ascend in complexity, with the most challenging category often requiring thinking about words in uncommon ways or as parts of larger phrases. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning and a way to measure the semantic information encoded by data-driven linguistic systems. In particular, we study both a sentence-embedding baseline and modern large language models (LLMs). We report their accuracy on the task, measure the impacts of chain-of-thought prompting, and discuss their failure modes. Overall, we find that the Connections task is challenging yet feasible, and a strong test-bed for future work.
Related papers
- Making New Connections: LLMs as Puzzle Generators for The New York Times' Connections Word Game [6.136654326170453]
The Connections puzzle is a word association game published daily by The New York Times (NYT)
generating novel puzzles requires a form of metacognition: generators must be able to accurately model the downstream reasoning of potential solvers.
Our findings show that LLMs are capable puzzle creators, and can generate diverse sets of enjoyable, challenging, and creative Connections puzzles as judged by human users.
arXiv Detail & Related papers (2024-07-15T21:05:25Z) - Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game [20.64536059771047]
We evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players.
Our results show that even the best performing LLM, Claude 3.5 Sonnet, can only fully solve 18% of the games.
We create a taxonomy of the knowledge types required to successfully cluster and categorize words in the Connections game.
arXiv Detail & Related papers (2024-06-16T17:10:32Z) - PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns [69.17409440805498]
We evaluate large multimodal models with abstract patterns based on fundamental concepts.
We find that they are not able to generalize well to simple abstract patterns.
Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities.
arXiv Detail & Related papers (2024-03-20T05:37:24Z) - Retrieval-Generation Synergy Augmented Large Language Models [30.53260173572783]
We propose an iterative retrieval-generation collaborative framework.
We conduct experiments on four question answering datasets, including single-hop QA and multi-hop QA tasks.
arXiv Detail & Related papers (2023-10-08T12:50:57Z) - BRAINTEASER: Lateral Thinking Puzzles for Large Language Models [15.95314613982879]
BRAINTEASER is a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking.
Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance.
We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.
arXiv Detail & Related papers (2023-10-08T07:46:01Z) - Brain in a Vat: On Missing Pieces Towards Artificial General
Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents.
We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations.
We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z) - Tree of Thoughts: Deliberate Problem Solving with Large Language Models [52.31950122881687]
We introduce a new framework for language model inference, Tree of Thoughts (ToT)
ToT generalizes over the popular Chain of Thought approach to prompting language models.
Our experiments show that ToT significantly enhances language models' problem-solving abilities.
arXiv Detail & Related papers (2023-05-17T23:16:17Z) - A Linguistic Investigation of Machine Learning based Contradiction
Detection Models: An Empirical Analysis and Future Perspectives [0.34998703934432673]
We analyze two Natural Language Inference data sets with respect to their linguistic features.
The goal is to identify those syntactic and semantic properties that are particularly hard to comprehend for a machine learning model.
arXiv Detail & Related papers (2022-10-19T10:06:03Z) - Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw
Puzzles [67.39567701983357]
Video Anomaly Detection (VAD) is an important topic in computer vision.
Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task.
Our method outperforms state-of-the-art counterparts on three public benchmarks.
arXiv Detail & Related papers (2022-07-20T19:49:32Z) - KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT)
All tasks in KILT are grounded in the same snapshot of Wikipedia.
We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z) - PuzzLing Machines: A Challenge on Learning From Small Data [64.513459448362]
We introduce a challenge on learning from small data, PuzzLing Machines, which consists of Rosetta Stone puzzles from Linguistic Olympiads for high school students.
Our challenge contains around 100 puzzles covering a wide range of linguistic phenomena from 81 languages.
We show that both simple statistical algorithms and state-of-the-art deep neural models perform inadequately on this challenge, as expected.
arXiv Detail & Related papers (2020-04-27T20:34:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.