Related papers: PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

URL: http://arxiv.org/abs/2508.00344v1
Date: Fri, 01 Aug 2025 06:17:11 GMT
Title: PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning
Authors: Keer Lu, Chong Chen, Bin Cui, Huang Leng, Wentao Zhang,
Abstract summary: Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks.<n>Current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories.<n>We introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution.
Score: 36.051921179063264
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model's ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model's planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.

Related papers

Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards [50.21528417884747]
We introduce Omni-Thinker, a unified reinforcement learning framework that enhances large language models (LLMs) performance across diverse tasks.<n>Our approach enables consistent optimization across task types and scales RL-based training to subjective domains.<n> Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging.
arXiv Detail & Related papers (2025-07-20T01:50:16Z)
PGPO: Enhancing Agent Reasoning via Pseudocode-style Planning Guided Preference Optimization [58.465778756331574]
We propose a pseudocode-style Planning Guided Preference Optimization method called PGPO for effective agent learning.<n>With two planning-oriented rewards, PGPO further enhances LLM agents' ability to generate high-quality P-code Plans.<n>Experiments show that PGPO achieves superior performance on representative agent benchmarks and outperforms the current leading baselines.
arXiv Detail & Related papers (2025-06-02T09:35:07Z)
Reinforced Reasoning for Embodied Planning [18.40186665383579]
Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals.<n>We introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning.
arXiv Detail & Related papers (2025-05-28T07:21:37Z)
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks [110.20297293596005]
Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks.<n>Existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs.<n>We propose a novel RL algorithm, SWEET-RL, that uses a carefully designed optimization objective to train a critic model with access to additional training-time information.<n>Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms.
arXiv Detail & Related papers (2025-03-19T17:55:08Z)
MPO: Boosting LLM Agents with Meta Plan Optimization [37.35230659116656]
Large language models (LLMs) have enabled agents to successfully tackle interactive planning tasks.<n>Existing approaches often suffer from planning hallucinations and require retraining for each new agent.<n>We propose the Meta Plan Optimization framework, which enhances agent planning capabilities by directly incorporating explicit guidance.
arXiv Detail & Related papers (2025-03-04T14:54:45Z)
Retrieval-Augmented Hierarchical in-Context Reinforcement Learning and Hindsight Modular Reflections for Task Planning with LLMs [8.55917897789612]
We propose Retrieval-Augmented in-context reinforcement Learning (RAHL) for large language models. RAHL decomposes complex tasks into sub-tasks using an LLM-based high-level policy. We show that RAHL can achieve an improvement in performance in 9%, 42%, and 10% in 5 episodes of execution in strong baselines.
arXiv Detail & Related papers (2024-08-12T22:40:01Z)
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation [81.32722475387364]
Large Language Model-based agents have garnered significant attention and are becoming increasingly popular.<n>Planning ability is a crucial component of an LLM-based agent, which generally entails achieving a desired goal from an initial state.<n>Recent studies have demonstrated that utilizing expert-level trajectory for instruction-tuning LLMs effectively enhances their planning capabilities.
arXiv Detail & Related papers (2024-08-01T17:59:46Z)
Planning with Multi-Constraints via Collaborative Language Agents [13.550774629515843]
This paper introduces Planning with Multi-Constraints (PMC), a zero-shot methodology for collaborative multi-agent systems.<n>PMC simplifies complex task planning with constraints by decomposing it into a hierarchy of subordinate tasks.<n>PMC achieved an average 42.68% success rate on TravelPlanner, significantly higher than GPT-4 (2.92%), and outperforming GPT-4 with ReAct on API-Bank by 13.64%.
arXiv Detail & Related papers (2024-05-26T10:33:17Z)
Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning [56.82041895921434]
Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities. When used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4.
arXiv Detail & Related papers (2024-03-29T03:48:12Z)
Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous [66.6895109554163]
Underlying the human ability to align goals with other agents is their ability to predict the intentions of others and actively update their own plans. We propose hierarchical predictive planning (HPP), a model-based reinforcement learning method for decentralized multiagent rendezvous.
arXiv Detail & Related papers (2020-03-15T19:49:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.