Related papers: DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

URL: http://arxiv.org/abs/2601.18137v1
Date: Mon, 26 Jan 2026 04:43:49 GMT
Title: DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
Authors: Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, Junyang Lin,
Abstract summary: We introduce DeepPlanning, a benchmark for practical long-horizon agent planning.<n>It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization.
Score: 25.987776928014707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.

Related papers

DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping [74.34061104176554]
We propose DeepPlanner, an end-to-end RL framework that effectively enhances the planning capabilities of deep research agents.<n>Our approach shapes token-level advantage with an entropy-based term to allocate larger updates to high entropy tokens, and selectively upweights sample-level advantages for planning-intensive rollouts.
arXiv Detail & Related papers (2025-10-14T20:47:05Z)
ParaCook: On Time-Efficient Planning for Multi-Agent Systems [62.471032881396496]
Large Language Models (LLMs) exhibit strong reasoning abilities for planning long-horizon, real-world tasks.<n>We present ParaCook, a benchmark for time-efficient collaborative planning.
arXiv Detail & Related papers (2025-10-13T16:47:07Z)
Code Driven Planning with Domain-Adaptive Critic [41.04089289445378]
We propose Code Driven Planning with Domain-Adaptive Critic (CoPiC)<n>Instead of relying on frequent queries, CoPiC employs LLMs to generate a diverse set of high-level planning programs.<n>A trained domain-adaptive critic then evaluates these candidates and selects the one most aligned with long-term rewards for execution.
arXiv Detail & Related papers (2025-09-23T14:36:12Z)
Wide-Horizon Thinking and Simulation-Based Evaluation for Real-World LLM Planning with Multifaceted Constraints [39.01715254437105]
This paper introduces the Multiple Aspects of Planning (MAoP) to solve planning problems with multifaceted constraints.<n>Instead of direct planning, MAoP leverages the strategist to conduct pre-planning from various aspects and provide the planning blueprint for planners.
arXiv Detail & Related papers (2025-06-14T09:37:59Z)
PGPO: Enhancing Agent Reasoning via Pseudocode-style Planning Guided Preference Optimization [58.465778756331574]
We propose a pseudocode-style Planning Guided Preference Optimization method called PGPO for effective agent learning.<n>With two planning-oriented rewards, PGPO further enhances LLM agents' ability to generate high-quality P-code Plans.<n>Experiments show that PGPO achieves superior performance on representative agent benchmarks and outperforms the current leading baselines.
arXiv Detail & Related papers (2025-06-02T09:35:07Z)
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks [36.63527489464188]
Plan-and-Act is a framework that incorporates explicit planning into large language models (LLMs)<n>Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions.<n>We present a state-of-the-art 57.58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on WebVoyager.
arXiv Detail & Related papers (2025-03-12T17:40:52Z)
A Survey on Large Language Models for Automated Planning [15.767084100431115]
We critically investigate existing research on the use of Large Language Models in automated planning.<n>We illustrate that although LLMs are not well-suited to serve as standalone planners because of these limitations, they nonetheless present an enormous opportunity to enhance planning applications when combined with other approaches.
arXiv Detail & Related papers (2025-02-18T02:11:03Z)
Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning [94.76546523689113]
We introduce CodePlan, a framework that generates and follows textcode-form plans -- pseudocode that outlines high-level, structured reasoning processes. CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks. It achieves a 25.1% relative improvement compared with directly generating responses.
arXiv Detail & Related papers (2024-09-19T04:13:58Z)
Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs) We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios. We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z)
LLM-Assist: Enhancing Closed-Loop Planning with Language-Based Reasoning [65.86754998249224]
We develop a novel hybrid planner that leverages a conventional rule-based planner in conjunction with an LLM-based planner. Our approach navigates complex scenarios which existing planners struggle with, produces well-reasoned outputs while also remaining grounded through working alongside the rule-based approach.
arXiv Detail & Related papers (2023-12-30T02:53:45Z)
LLM-SAP: Large Language Models Situational Awareness Based Planning [0.0]
We employ a multi-agent reasoning framework to develop a methodology that anticipates and actively mitigates potential risks. Our approach diverges from traditional automata theory by incorporating the complexity of human-centric interactions into the planning process.
arXiv Detail & Related papers (2023-12-26T17:19:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.