HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games
- URL: http://arxiv.org/abs/2510.12563v2
- Date: Wed, 15 Oct 2025 10:31:28 GMT
- Title: HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games
- Authors: Jingcong Liang, Shijun Wan, Xuehai Wu, Yitong Li, Qianglong Chen, Duyu Tang, Siyuan Wang, Zhongyu Wei,
- Abstract summary: Large Reasoning Models (LRMs) have demonstrated impressive performance on complex tasks, including logical puzzle games.<n>Existing corpora focus on popular puzzles like 9x9 Sudoku, risking overfitting to canonical formats and memorization of solution patterns.<n>We introduce HardcoreLogic, a challenging benchmark of over 5,000 puzzles across 10 games.
- Score: 47.168515381473576
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Reasoning Models (LRMs) have demonstrated impressive performance on complex tasks, including logical puzzle games that require deriving solutions satisfying all constraints. However, whether they can flexibly apply appropriate rules to varying conditions, particularly when faced with non-canonical game variants, remains an open question. Existing corpora focus on popular puzzles like 9x9 Sudoku, risking overfitting to canonical formats and memorization of solution patterns, which can mask deficiencies in understanding novel rules or adapting strategies to new variants. To address this, we introduce HardcoreLogic, a challenging benchmark of over 5,000 puzzles across 10 games, designed to test the robustness of LRMs on the "long-tail" of logical games. HardcoreLogic systematically transforms canonical puzzles through three dimensions: Increased Complexity (IC), Uncommon Elements (UE), and Unsolvable Puzzles (UP), reducing reliance on shortcut memorization. Evaluations on a diverse set of LRMs reveal significant performance drops, even for models achieving top scores on existing benchmarks, indicating heavy reliance on memorized stereotypes. While increased complexity is the dominant source of difficulty, models also struggle with subtle rule variations that do not necessarily increase puzzle difficulty. Our systematic error analysis on solvable and unsolvable puzzles further highlights gaps in genuine reasoning. Overall, HardcoreLogic exposes the limitations of current LRMs and establishes a benchmark for advancing high-level logical reasoning.
Related papers
- PHANTOM RECALL: When Familiar Puzzles Fool Smart Models [29.172155264798466]
Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles.<n>Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles.<n>Despite near-perfect accuracy on puzzles, models significantly underperform humans on unmodified ones.
arXiv Detail & Related papers (2025-10-13T18:09:50Z) - PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles [53.47227295854126]
This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments.<n>We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles.
arXiv Detail & Related papers (2025-10-07T21:24:29Z) - Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.497499123166804]
This work studies the performance of frontier language models on a broad set of "easy" reasoning problems.<n>We create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning.<n>We show that even state-of-the-art thinking models consistently fail on such problems and for similar reasons.
arXiv Detail & Related papers (2025-07-09T22:22:49Z) - PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts [47.92619068073141]
We introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning.<n>Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy.<n>Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning.
arXiv Detail & Related papers (2025-06-06T16:17:09Z) - Sudoku-Bench: Evaluating creative reasoning with Sudoku variants [17.624558883326184]
Sudoku-Bench is a curated benchmark to evaluate creative, multi-step logical reasoning.<n>Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles.
arXiv Detail & Related papers (2025-05-22T02:24:35Z) - EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges [17.056693711040747]
We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events.<n>This dataset probes models' ability to perform implicit knowledge synthesis and multi-step deductive reasoning.<n>The benchmark comprises 1184 puzzles of varying complexity requiring teams of skilled solvers hours to days to complete.
arXiv Detail & Related papers (2025-02-13T00:18:34Z) - On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes.<n>One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems.<n>We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z) - Automated Graph Genetic Algorithm based Puzzle Validation for Faster
Game Desig [69.02688684221265]
This paper presents an evolutionary algorithm, empowered by expert-knowledge informeds, for solving logical puzzles in video games efficiently.
We discuss multiple variations of hybrid genetic approaches for constraint satisfaction problems that allow us to find a diverse set of near-optimal solutions for puzzles.
arXiv Detail & Related papers (2023-02-17T18:15:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.