DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping
- URL: http://arxiv.org/abs/2510.12979v1
- Date: Tue, 14 Oct 2025 20:47:05 GMT
- Title: DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping
- Authors: Wei Fan, Wenlin Yao, Zheng Li, Feng Yao, Xin Liu, Liang Qiu, Qingyu Yin, Yangqiu Song, Bing Yin,
- Abstract summary: We propose DeepPlanner, an end-to-end RL framework that effectively enhances the planning capabilities of deep research agents.<n>Our approach shapes token-level advantage with an entropy-based term to allocate larger updates to high entropy tokens, and selectively upweights sample-level advantages for planning-intensive rollouts.
- Score: 74.34061104176554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) augmented with multi-step reasoning and action generation abilities have shown promise in leveraging external tools to tackle complex tasks that require long-horizon planning. However, existing approaches either rely on implicit planning in the reasoning stage or introduce explicit planners without systematically addressing how to optimize the planning stage. As evidence, we observe that under vanilla reinforcement learning (RL), planning tokens exhibit significantly higher entropy than other action tokens, revealing uncertain decision points that remain under-optimized. To address this, we propose DeepPlanner, an end-to-end RL framework that effectively enhances the planning capabilities of deep research agents. Our approach shapes token-level advantage with an entropy-based term to allocate larger updates to high entropy tokens, and selectively upweights sample-level advantages for planning-intensive rollouts. Extensive experiments across seven deep research benchmarks demonstrate that DeepPlanner improves planning quality and achieves state-of-the-art results under a substantially lower training budget.
Related papers
- DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints [25.987776928014707]
We introduce DeepPlanning, a benchmark for practical long-horizon agent planning.<n>It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization.
arXiv Detail & Related papers (2026-01-26T04:43:49Z) - PGPO: Enhancing Agent Reasoning via Pseudocode-style Planning Guided Preference Optimization [58.465778756331574]
We propose a pseudocode-style Planning Guided Preference Optimization method called PGPO for effective agent learning.<n>With two planning-oriented rewards, PGPO further enhances LLM agents' ability to generate high-quality P-code Plans.<n>Experiments show that PGPO achieves superior performance on representative agent benchmarks and outperforms the current leading baselines.
arXiv Detail & Related papers (2025-06-02T09:35:07Z) - HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking [109.09735490692202]
We propose HyperTree Planning (HTP), a novel reasoning paradigm that constructs hypertree-structured planning outlines for effective planning.<n> Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the TravelPlanner benchmark with Gemini-1.5-Pro, resulting in a 3.6 times performance improvement over o1-preview.
arXiv Detail & Related papers (2025-05-05T02:38:58Z) - Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks [36.63527489464188]
Plan-and-Act is a framework that incorporates explicit planning into large language models (LLMs)<n>Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions.<n>We present a state-of-the-art 57.58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on WebVoyager.
arXiv Detail & Related papers (2025-03-12T17:40:52Z) - DHP: Discrete Hierarchical Planning for Hierarchical Reinforcement Learning Agents [2.1438108757511958]
We propose a method that replaces continuous distance estimates with discrete reachability checks to evaluate subgoal feasibility.<n>Experiments in 25-room navigation environments demonstrate $100%$ success rate.<n>The method also generalizes to momentum-based control tasks and requires only $log N$ steps for replanning.
arXiv Detail & Related papers (2025-02-04T03:05:55Z) - Learning to Plan with Personalized Preferences [16.65506804881317]
We introduce Preference-based Planning (PbP) benchmark, an embodied benchmark featuring hundreds of diverse preferences spanning from atomic actions to complex sequences.<n>Our evaluation of SOTA methods reveals that while symbol-based approaches show promise in scalability, significant challenges remain in learning to generate and execute plans that satisfy personalized preferences.<n>These findings establish preferences as a valuable abstraction layer for adaptive planning, opening new directions for research in preference-guided plan generation and execution.
arXiv Detail & Related papers (2025-02-02T17:16:25Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs)
We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios.
We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - Learning Logic Specifications for Policy Guidance in POMDPs: an
Inductive Logic Programming Approach [57.788675205519986]
We learn high-quality traces from POMDP executions generated by any solver.
We exploit data- and time-efficient Indu Logic Programming (ILP) to generate interpretable belief-based policy specifications.
We show that learneds expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specifics within lower computational time.
arXiv Detail & Related papers (2024-02-29T15:36:01Z) - LLM-SAP: Large Language Models Situational Awareness Based Planning [0.0]
We employ a multi-agent reasoning framework to develop a methodology that anticipates and actively mitigates potential risks.
Our approach diverges from traditional automata theory by incorporating the complexity of human-centric interactions into the planning process.
arXiv Detail & Related papers (2023-12-26T17:19:09Z) - Probabilistic contingent planning based on HTN for high-quality plans [8.23558342809427]
We propose a contingent Hierarchical Task Network (HTN) planner, named High-Quality Contingent Planner (HQCP)
HQCP generates high-quality plans in the partially observable environment.
The formalisms in HTN planning are extended into partial observability and are evaluated regarding the cost.
arXiv Detail & Related papers (2023-08-14T03:55:14Z) - Hierarchical Imitation Learning with Vector Quantized Models [77.67190661002691]
We propose to use reinforcement learning to identify subgoals in expert trajectories.
We build a vector-quantized generative model for the identified subgoals to perform subgoal-level planning.
In experiments, the algorithm excels at solving complex, long-horizon decision-making problems outperforming state-of-the-art.
arXiv Detail & Related papers (2023-01-30T15:04:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.