Related papers: Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents

Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents

URL: http://arxiv.org/abs/2509.03581v2
Date: Tue, 30 Sep 2025 09:12:45 GMT
Title: Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents
Authors: Davide Paglieri, Bartłomiej Cupiał, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel,
Abstract summary: Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities.<n>We introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning.<n>Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives.
Score: 35.79575378215309
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities. In agentic settings, existing methods like ReAct prompt LLMs to explicitly plan before every action; however, we demonstrate that always planning is computationally expensive and degrades performance on long-horizon tasks, while never planning further limits performance. To address this, we introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning. We propose a simple two-stage training pipeline: (1) supervised fine-tuning on diverse synthetic data to prime models for dynamic planning, and (2) RL to refine this capability in long-horizon environments. Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives. Additionally, we demonstrate that these agents can be effectively steered by human-written plans, surpassing their independent capabilities. To our knowledge, this work is the first to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks, paving the way for more efficient, adaptive, and controllable agentic systems.

Related papers

rSIM: Incentivizing Reasoning Capabilities of LLMs via Reinforced Strategy Injection [49.74493901036598]
Large language models (LLMs) are post-trained through reinforcement learning (RL) to evolve into Reasoning Language Models (RLMs)<n>This paper proposes a novel reinforced strategy injection mechanism (rSIM) that enables any LLM to become an RLM by employing a small planner.<n> Experimental results show that rSIM enables Qwen2.5-0.5B to become an RLM and significantly outperform Qwen2.5-14B.
arXiv Detail & Related papers (2025-12-09T06:55:39Z)
A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks [66.86312354478478]
Agents based on large language models (LLMs) struggle with brainless trial-and-error and generating hallucinatory actions due to a lack of global planning in long-horizon tasks.<n>We introduce a plan-and-execute framework and propose a planner training method to enhance the executor agent's planning abilities without human effort.<n>Experiments show that executor agents equipped with our planner outperform existing methods, achieving new state-of-the-art performance.
arXiv Detail & Related papers (2025-10-07T06:10:53Z)
Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning [6.314485350935057]
Reinforcement Learning with Tool-use Rewards is a novel framework that decouples the training process to enable a focused, single-objective optimization of the planning module.<n>Our experiments demonstrate that RLTR achieves an 8%-12% improvement in planning performance compared to end-to-end baselines.<n>This enhanced planning capability, in turn, translates to a 5%-6% increase in the final response quality of the overall agent system.
arXiv Detail & Related papers (2025-08-27T06:19:50Z)
PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving [66.42260489147617]
We introduce PLAN-TUNING, a framework that distills synthetic task decompositions from large-scale language models.<n>Plan-TUNING fine-tunes smaller models via supervised and reinforcement-learning objectives to improve complex reasoning.<n>Our analysis demonstrates how planning trajectories improves complex reasoning capabilities.
arXiv Detail & Related papers (2025-07-10T07:30:44Z)
Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL [62.984693936073974]
Large language models (LLMs) excel in tasks like question answering and dialogue.<n>Complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning.<n>We propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents.
arXiv Detail & Related papers (2025-05-23T16:51:54Z)
Leveraging Pre-trained Large Language Models with Refined Prompting for Online Task and Motion Planning [24.797220935378057]
We present a closed-loop task planning and acting system, LLM-PAS, which is assisted by a pre-trained Large Language Model (LLM)<n>We demonstrate the effectiveness and robustness of LLM-PAS in handling anomalous conditions during task execution.
arXiv Detail & Related papers (2025-04-30T12:53:53Z)
MPO: Boosting LLM Agents with Meta Plan Optimization [37.35230659116656]
Large language models (LLMs) have enabled agents to successfully tackle interactive planning tasks.<n>Existing approaches often suffer from planning hallucinations and require retraining for each new agent.<n>We propose the Meta Plan Optimization framework, which enhances agent planning capabilities by directly incorporating explicit guidance.
arXiv Detail & Related papers (2025-03-04T14:54:45Z)
Complex LLM Planning via Automated Heuristics Discovery [48.07520536415374]
We consider enhancing large language models (LLMs) for complex planning tasks.<n>We propose automated inferences discovery (AutoHD), a novel approach that enables LLMs to explicitly generate functions to guide-time search.<n>Our proposed method requires no additional model training or finetuning--and the explicit definition of functions generated by the LLMs provides interpretability and insights into the reasoning process.
arXiv Detail & Related papers (2025-02-26T16:52:31Z)
Zero-shot Robotic Manipulation with Language-guided Instruction and Formal Task Planning [16.89900521727246]
We propose an innovative language-guided symbolic task planning (LM-SymOpt) framework with optimization.<n>It is the first expert-free planning framework since we combine the world knowledge from Large Language Models with formal reasoning.<n>Our experimental results show that LM-SymOpt outperforms existing LLM-based planning approaches.
arXiv Detail & Related papers (2025-01-25T13:33:22Z)
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation [81.32722475387364]
Large Language Model-based agents have garnered significant attention and are becoming increasingly popular.<n>Planning ability is a crucial component of an LLM-based agent, which generally entails achieving a desired goal from an initial state.<n>Recent studies have demonstrated that utilizing expert-level trajectory for instruction-tuning LLMs effectively enhances their planning capabilities.
arXiv Detail & Related papers (2024-08-01T17:59:46Z)
From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.