Related papers: Sudoku-Bench: Evaluating creative reasoning with Sudoku variants

Sudoku-Bench: Evaluating creative reasoning with Sudoku variants

URL: http://arxiv.org/abs/2505.16135v1
Date: Thu, 22 May 2025 02:24:35 GMT
Title: Sudoku-Bench: Evaluating creative reasoning with Sudoku variants
Authors: Jeffrey Seely, Yuki Imajuku, Tianyu Zhao, Edoardo Cetin, Llion Jones,
Abstract summary: Sudoku-Bench is a curated benchmark to evaluate creative, multi-step logical reasoning.<n>Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles.
Score: 17.624558883326184
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles -- making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15\% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.

Related papers

Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.497499123166804]
This work studies the performance of frontier language models on a broad set of "easy" reasoning problems.<n>We create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning.<n>We show that even state-of-the-art thinking models consistently fail on such problems and for similar reasons.
arXiv Detail & Related papers (2025-07-09T22:22:49Z)
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts [47.92619068073141]
We introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning.<n>Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy.<n>Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning.
arXiv Detail & Related papers (2025-06-06T16:17:09Z)
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint [48.35508965276618]
Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs)<n>In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles.
arXiv Detail & Related papers (2025-05-29T17:59:47Z)
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation [53.452699232071495]
CrossWordBench is a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) through the medium of crossword puzzles.<n>Our evaluation reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints.<n>Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
arXiv Detail & Related papers (2025-03-30T20:03:36Z)
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges [17.056693711040747]
We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events.<n>This dataset probes models' ability to perform implicit knowledge synthesis and multi-step deductive reasoning.<n>The benchmark comprises 1184 puzzles of varying complexity requiring teams of skilled solvers hours to days to complete.
arXiv Detail & Related papers (2025-02-13T00:18:34Z)
Mathematical Definition and Systematization of Puzzle Rules [0.0]
We introduce a mathematical framework for defining and systematizing pencil puzzle rules.<n>This framework formalizes grid elements, their positional relationships, and iterative composition operations.<n>Applying this framework, we successfully formalized the rules of well-known Nikoli puzzles, including Slitherlink and Sudoku.
arXiv Detail & Related papers (2024-12-18T02:00:53Z)
On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes.<n>One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems.<n>We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z)
Multi-Phase Relaxation Labeling for Square Jigsaw Puzzle Solving [73.58829980121767]
We present a novel method for solving square jigsaw puzzles based on global optimization. The method is fully automatic, assumes no prior information, and can handle puzzles with known or unknown piece orientation.
arXiv Detail & Related papers (2023-03-26T18:53:51Z)
Automated Graph Genetic Algorithm based Puzzle Validation for Faster Game Desig [69.02688684221265]
This paper presents an evolutionary algorithm, empowered by expert-knowledge informeds, for solving logical puzzles in video games efficiently. We discuss multiple variations of hybrid genetic approaches for constraint satisfaction problems that allow us to find a diverse set of near-optimal solutions for puzzles.
arXiv Detail & Related papers (2023-02-17T18:15:33Z)
Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles [67.39567701983357]
Video Anomaly Detection (VAD) is an important topic in computer vision. Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task. Our method outperforms state-of-the-art counterparts on three public benchmarks.
arXiv Detail & Related papers (2022-07-20T19:49:32Z)
Using Small MUSes to Explain How to Solve Pen and Paper Puzzles [4.535832029902474]
We present Demystify, a tool which allows puzzles to be expressed in a high-level constraint programming language. We give several improvements to the existing techniques for solving puzzles with MUSes. We demonstrate the effectiveness and generality of Demystify by comparing its results to documented strategies for solving a range of pen and paper puzzles by hand.
arXiv Detail & Related papers (2021-04-30T15:07:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.