SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models
- URL: http://arxiv.org/abs/2601.20856v1
- Date: Wed, 28 Jan 2026 18:56:00 GMT
- Title: SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models
- Authors: Sebastiano Monti, Carlo Nicolini, Gianni Pellegrini, Jacopo Staiano, Bruno Lepri,
- Abstract summary: We provide a systematic assessment of the planning and long-horizon reasoning capabilities of large language models.<n>We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence.<n>Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution.
- Score: 6.718614728570862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although the capabilities of large language models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated. In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence. Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting a fundamental constraint on forward planning capacity. We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting inherent architectural limitations which might not be overcome by test-time scaling approaches alone.
Related papers
- Laplacian Representations for Decision-Time Planning [20.25004555858261]
We show that the Laplacian representation provides an effective latent space for planning by capturing state-space distances at multiple time scales.<n>This representation preserves meaningful distances and naturally decomposes long-horizon problems into subgoals, also mitigating the compounding errors that arise over long prediction horizons.<n>We introduce ALPS, a hierarchical planning algorithm, and demonstrate that it outperforms commonly used baselines on a selection of offline goal-conditioned RL tasks from OGBench.
arXiv Detail & Related papers (2026-02-04T20:34:50Z) - Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents [42.09897801169138]
Large language model (LLM)-based agents exhibit strong step-by-step reasoning capabilities over short horizons, yet often fail to sustain coherent behavior over long planning horizons.<n>We argue that step-wise reasoning induces a form of step-wise greedy policy that is adequate for short horizons but fails in long-horizon planning.<n>We introduce FLARE as a minimal instantiation of future-aware planning to enforce explicit lookahead, value propagation, and limited commitment in a single model.
arXiv Detail & Related papers (2026-01-29T20:52:32Z) - PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models [51.43746425777865]
Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks.<n>We propose PILOT, a framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance.
arXiv Detail & Related papers (2026-01-07T12:38:56Z) - DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping [74.34061104176554]
We propose DeepPlanner, an end-to-end RL framework that effectively enhances the planning capabilities of deep research agents.<n>Our approach shapes token-level advantage with an entropy-based term to allocate larger updates to high entropy tokens, and selectively upweights sample-level advantages for planning-intensive rollouts.
arXiv Detail & Related papers (2025-10-14T20:47:05Z) - Can LLM-Reasoning Models Replace Classical Planning? A Benchmark Study [0.0]
Large Language Models have sparked interest in their potential for robotic task planning.<n>While these models demonstrate strong generative capabilities, their effectiveness in producing structured and executable plans remains uncertain.<n>This paper presents a systematic evaluation of a broad spectrum of current state of the art language models.
arXiv Detail & Related papers (2025-07-31T14:25:54Z) - HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking [109.09735490692202]
We propose HyperTree Planning (HTP), a novel reasoning paradigm that constructs hypertree-structured planning outlines for effective planning.<n> Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the TravelPlanner benchmark with Gemini-1.5-Pro, resulting in a 3.6 times performance improvement over o1-preview.
arXiv Detail & Related papers (2025-05-05T02:38:58Z) - LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models [18.66263636077183]
Large Language Models (LLMs) excel in various natural language tasks but often struggle with long-horizon planning problems.<n>This limitation has drawn interest in integrating neuro-symbolic approaches within the Automated Planning (AP) and Natural Language Processing (NLP) communities.
arXiv Detail & Related papers (2025-03-22T03:35:44Z) - A Survey on Large Language Models for Automated Planning [15.767084100431115]
We critically investigate existing research on the use of Large Language Models in automated planning.<n>We illustrate that although LLMs are not well-suited to serve as standalone planners because of these limitations, they nonetheless present an enormous opportunity to enhance planning applications when combined with other approaches.
arXiv Detail & Related papers (2025-02-18T02:11:03Z) - Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning [94.76546523689113]
We introduce CodePlan, a framework that generates and follows textcode-form plans -- pseudocode that outlines high-level, structured reasoning processes.
CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks.
It achieves a 25.1% relative improvement compared with directly generating responses.
arXiv Detail & Related papers (2024-09-19T04:13:58Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs)
We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios.
We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [61.98556945939045]
We propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories.
Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework.
arXiv Detail & Related papers (2024-02-01T15:18:33Z) - LLM-SAP: Large Language Models Situational Awareness Based Planning [0.0]
We employ a multi-agent reasoning framework to develop a methodology that anticipates and actively mitigates potential risks.
Our approach diverges from traditional automata theory by incorporating the complexity of human-centric interactions into the planning process.
arXiv Detail & Related papers (2023-12-26T17:19:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.