Related papers: PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

URL: http://arxiv.org/abs/2510.06475v1
Date: Tue, 07 Oct 2025 21:24:29 GMT
Title: PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
Authors: Yitao Long, Yuru Jiang, Hongjun Liu, Yilun Zhao, Jingchen Sun, Yiqiu Shen, Chen Zhao, Arman Cohan, Dennis Shasha,
Abstract summary: This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments.<n>We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles.
Score: 53.47227295854126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.

Related papers

Solving Convex Partition Visual Jigsaw Puzzles [3.0427549266235125]
Jigsaw puzzle solving requires rearrangement of unordered pieces into their original pose in order to reconstruct a coherent whole.<n>Most of the literature has focused on developing solvers for square jigsaw puzzles, severely limiting their practical use.<n>In this work, we significantly expand the types of puzzles handled computationally, focusing on what is known as Convex Partitions.
arXiv Detail & Related papers (2025-11-06T15:22:46Z)
HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games [47.168515381473576]
Large Reasoning Models (LRMs) have demonstrated impressive performance on complex tasks, including logical puzzle games.<n>Existing corpora focus on popular puzzles like 9x9 Sudoku, risking overfitting to canonical formats and memorization of solution patterns.<n>We introduce HardcoreLogic, a challenging benchmark of over 5,000 puzzles across 10 games.
arXiv Detail & Related papers (2025-10-14T14:23:24Z)
GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games [8.640618631999173]
We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs)<n>Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model's ability to handle tasks that differ from most existing LLM benchmarks.
arXiv Detail & Related papers (2025-08-11T22:17:07Z)
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts [47.92619068073141]
We introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning.<n>Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy.<n>Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning.
arXiv Detail & Related papers (2025-06-06T16:17:09Z)
PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving [50.50405233978406]
We propose a fully dynamic multimodal evaluation framework, named Open-ended Visual Puzzle Generation (OVPG)<n>OVPG aims to generate fresh, diverse, and verifiable evaluation data automatically in puzzle-solving tasks.<n>Built upon OVPG, we construct PuzzleBench, a dynamic and scalable benchmark comprising 11,840 VQA samples.
arXiv Detail & Related papers (2025-04-15T05:29:31Z)
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation [53.452699232071495]
We introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) through the medium of crossword puzzles.<n>Our evaluation reveals that reasoning LLMs substantially outperform non-reasoning models by effectively leveraging crossing-letter constraints.<n>Our findings highlight limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
arXiv Detail & Related papers (2025-03-30T20:03:36Z)
Mathematical Definition and Systematization of Puzzle Rules [0.0]
We introduce a mathematical framework for defining and systematizing pencil puzzle rules.<n>This framework formalizes grid elements, their positional relationships, and iterative composition operations.<n>Applying this framework, we successfully formalized the rules of well-known Nikoli puzzles, including Slitherlink and Sudoku.
arXiv Detail & Related papers (2024-12-18T02:00:53Z)
PUZZLES: A Benchmark for Neural Algorithmic Reasoning [21.57943896942296]
We introduce PUZZLES, a benchmark based on Simon Tatham's Portable Puzzle Collection. PUZZLES contains 40 diverse logic puzzles of adjustable sizes and varying levels of complexity. The puzzles provide detailed information on the strengths and generalization capabilities of RL agents.
arXiv Detail & Related papers (2024-06-29T11:02:05Z)
Automated Graph Genetic Algorithm based Puzzle Validation for Faster Game Desig [69.02688684221265]
This paper presents an evolutionary algorithm, empowered by expert-knowledge informeds, for solving logical puzzles in video games efficiently. We discuss multiple variations of hybrid genetic approaches for constraint satisfaction problems that allow us to find a diverse set of near-optimal solutions for puzzles.
arXiv Detail & Related papers (2023-02-17T18:15:33Z)
Portfolio Search and Optimization for General Strategy Game-Playing [58.896302717975445]
We propose a new algorithm for optimization and action-selection based on the Rolling Horizon Evolutionary Algorithm. For the optimization of the agents' parameters and portfolio sets we study the use of the N-tuple Bandit Evolutionary Algorithm. An analysis of the agents' performance shows that the proposed algorithm generalizes well to all game-modes and is able to outperform other portfolio methods.
arXiv Detail & Related papers (2021-04-21T09:28:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.