Related papers: Language Models are Crossword Solvers

Language Models are Crossword Solvers

URL: http://arxiv.org/abs/2406.09043v2
Date: Fri, 14 Jun 2024 21:29:40 GMT
Title: Language Models are Crossword Solvers
Authors: Soumadeep Saha, Sutanoya Chakraborty, Saptarshi Saha, Utpal Garain,
Abstract summary: We tackle the challenge of solving crosswords with Large Language Models (LLMs) We demonstrate that the current generation of state-of-the art (SoTA) language models show significant competence at deciphering cryptic crossword clues. We also develop a search algorithm that builds off this performance to tackle the problem of solving full crossword grids with LLMs.
Score: 1.53744306569115
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Crosswords are a form of word puzzle that require a solver to demonstrate a high degree of proficiency in natural language understanding, wordplay, reasoning, and world knowledge, along with adherence to character and length constraints. In this paper we tackle the challenge of solving crosswords with Large Language Models (LLMs). We demonstrate that the current generation of state-of-the art (SoTA) language models show significant competence at deciphering cryptic crossword clues, and outperform previously reported SoTA results by a factor of 2-3 in relevant benchmarks. We also develop a search algorithm that builds off this performance to tackle the problem of solving full crossword grids with LLMs for the very first time, achieving an accuracy of 93\% on New York Times crossword puzzles. Contrary to previous work in this area which concluded that LLMs lag human expert performance significantly, our research suggests this gap is a lot narrower.

Related papers

Logic-of-Thought: Empowering Large Language Models with Logic Programs for Solving Puzzles in Natural Language [67.51318974970985]
Solving puzzles in natural language poses a long-standing challenge in AI.<n>We propose Logic-of-Thought, a framework that bridges large language models with logic programming.<n>We evaluate our method on various grid puzzles and dynamic puzzles involving actions, demonstrating near-perfect accuracy across all tasks.
arXiv Detail & Related papers (2025-05-22T01:37:40Z)
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation [53.452699232071495]
CrossWordBench is a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) through the medium of crossword puzzles. Our evaluation reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
arXiv Detail & Related papers (2025-03-30T20:03:36Z)
What Makes Cryptic Crosswords Challenging for LLMs? [4.463184061618504]
Cryptic crosswords are puzzles that rely on general knowledge and the solver's ability to manipulate language on different levels. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs)
arXiv Detail & Related papers (2024-12-12T07:23:52Z)
On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems. We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Can large language models understand uncommon meanings of common words? [30.527834781076546]
Large language models (LLMs) have shown significant advancements across diverse natural language understanding (NLU) tasks. Yet, lacking widely acknowledged testing mechanisms, answering whether LLMs are parrots or genuinely comprehend the world' remains unclear. This paper presents innovative construction of a Lexical Semantic dataset with novel evaluation metrics.
arXiv Detail & Related papers (2024-05-09T12:58:22Z)
Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z)
Are LLMs Good Cryptic Crossword Solvers? [4.463184061618504]
Cryptic crosswords are puzzles that rely on the solver's ability to manipulate language on different levels and deal with various types of wordplay. Previous research suggests that solving such puzzles is a challenge even for modern NLP models.
arXiv Detail & Related papers (2024-03-15T06:57:08Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z)
The WebCrow French Crossword Solver [6.758790625418374]
We extend WebCrow, an automatic crossword solver, to French, making it the first program for crossword solving in the French language. To cope with the lack of a large repository of clue-answer crossword data, WebCrow exploits multiple modules, called experts, that retrieve candidate answers from heterogeneous resources. We compared WebCrow's performance against humans in two different challenges. Despite the limited amount of past crosswords, French WebCrow was competitive, actually outperforming humans in terms of speed and accuracy.
arXiv Detail & Related papers (2023-11-27T08:45:31Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Shortcut Learning of Large Language Models in Natural Language Understanding [119.45683008451698]
Large language models (LLMs) have achieved state-of-the-art performance on a series of natural language understanding tasks. They might rely on dataset bias and artifacts as shortcuts for prediction. This has significantly affected their generalizability and adversarial robustness.
arXiv Detail & Related papers (2022-08-25T03:51:39Z)
Down and Across: Introducing Crossword-Solving as a New NLP Benchmark [11.194615436370507]
We release the specification of a corpus of crossword puzzles collected from the New York Times daily crossword spanning 25 years. These puzzles include a diverse set of clues: historic, factual, word meaning, synonyms/antonyms, fill-in-the-blank, abbreviations, prefixes/suffixes, wordplay, and cross-lingual.
arXiv Detail & Related papers (2022-05-20T21:16:44Z)
Automated Crossword Solving [38.36920665368784]
Our system improves exact puzzle accuracy from 57% to 82% on crosswords from The New York Times. Our system also won first place at the top human crossword tournament.
arXiv Detail & Related papers (2022-05-19T16:28:44Z)
Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP [5.447716844779342]
Cryptic crosswords are the dominant English-language crossword variety in the United Kingdom. We present a dataset of cryptic crossword clues that can be used as a benchmark and train a sequence-to-sequence model to solve them. We show that performance can be substantially improved using a novel curriculum learning approach.
arXiv Detail & Related papers (2021-04-17T18:54:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.