Related papers: What Makes Cryptic Crosswords Challenging for LLMs?

What Makes Cryptic Crosswords Challenging for LLMs?

URL: http://arxiv.org/abs/2412.09012v2
Date: Tue, 14 Jan 2025 06:06:54 GMT
Title: What Makes Cryptic Crosswords Challenging for LLMs?
Authors: Abdelrahman Sadallah, Daria Kotova, Ekaterina Kochmar,
Abstract summary: Cryptic crosswords are puzzles that rely on general knowledge and the solver's ability to manipulate language on different levels.<n>Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs)
Score: 4.463184061618504
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cryptic crosswords are puzzles that rely on general knowledge and the solver's ability to manipulate language on different levels, dealing with various types of wordplay. Previous research suggests that solving such puzzles is challenging even for modern NLP models, including Large Language Models (LLMs). However, there is little to no research on the reasons for their poor performance on this task. In this paper, we establish the benchmark results for three popular LLMs: Gemma2, LLaMA3 and ChatGPT, showing that their performance on this task is still significantly below that of humans. We also investigate why these models struggle to achieve superior performance. We release our code and introduced datasets at https://github.com/bodasadallah/decrypting-crosswords.

Related papers

Codenames as a Benchmark for Large Language Models [2.1028463367241033]
We use the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models. We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1. Our results indicate that while certain LLMs perform better than others overall, different models exhibit varying emergent behaviours during gameplay and excel at specific roles.
arXiv Detail & Related papers (2024-12-16T01:59:03Z)
CUTE: Measuring LLMs' Understanding of Their Tokens [54.70665106141121]
Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. This raises the question: To what extent can LLMs learn orthographic information? We propose a new benchmark, which features a collection of tasks designed to test the orthographic knowledge of LLMs.
arXiv Detail & Related papers (2024-09-23T18:27:03Z)
Language Models are Crossword Solvers [1.53744306569115]
We tackle the challenge of solving crosswords with Large Language Models (LLMs) We demonstrate that the current generation of state-of-the art (SoTA) language models show significant competence at deciphering cryptic crossword clues. We also develop a search algorithm that builds off this performance to tackle the problem of solving full crossword grids with LLMs.
arXiv Detail & Related papers (2024-06-13T12:29:27Z)
Are LLMs Good Cryptic Crossword Solvers? [4.463184061618504]
Cryptic crosswords are puzzles that rely on the solver's ability to manipulate language on different levels and deal with various types of wordplay. Previous research suggests that solving such puzzles is a challenge even for modern NLP models.
arXiv Detail & Related papers (2024-03-15T06:57:08Z)
The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency. We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes. It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z)
Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES. Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query. By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z)
Chain-of-Dictionary Prompting Elicits Translation in Large Language Models [100.47154959254937]
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT) We present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities.
arXiv Detail & Related papers (2023-05-11T05:19:47Z)
Can Large Language Models Transform Computational Social Science? [79.62471267510963]
Large Language Models (LLMs) are capable of performing many language processing tasks zero-shot (without training data) This work provides a road map for using LLMs as Computational Social Science tools.
arXiv Detail & Related papers (2023-04-12T17:33:28Z)
True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4 [0.0]
Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities. In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles. We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles.
arXiv Detail & Related papers (2022-12-20T09:34:43Z)
PuzzLing Machines: A Challenge on Learning From Small Data [64.513459448362]
We introduce a challenge on learning from small data, PuzzLing Machines, which consists of Rosetta Stone puzzles from Linguistic Olympiads for high school students. Our challenge contains around 100 puzzles covering a wide range of linguistic phenomena from 81 languages. We show that both simple statistical algorithms and state-of-the-art deep neural models perform inadequately on this challenge, as expected.
arXiv Detail & Related papers (2020-04-27T20:34:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.