Related papers: Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning

Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning

URL: http://arxiv.org/abs/2508.19598v1
Date: Wed, 27 Aug 2025 06:19:50 GMT
Title: Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning
Authors: Zhiwei Li, Yong Hu, Wenqing Wang,
Abstract summary: Reinforcement Learning with Tool-use Rewards is a novel framework that decouples the training process to enable a focused, single-objective optimization of the planning module.<n>Our experiments demonstrate that RLTR achieves an 8%-12% improvement in planning performance compared to end-to-end baselines.<n>This enhanced planning capability, in turn, translates to a 5%-6% increase in the final response quality of the overall agent system.
Score: 6.314485350935057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The functionality of Large Language Model (LLM) agents is primarily determined by two capabilities: action planning and answer summarization. The former, action planning, is the core capability that dictates an agent's performance. However, prevailing training paradigms employ end-to-end, multi-objective optimization that jointly trains both capabilities. This paradigm faces two critical challenges: imbalanced optimization objective allocation and scarcity of verifiable data, making it difficult to enhance the agent's planning capability. To address these challenges, we propose Reinforcement Learning with Tool-use Rewards (RLTR), a novel framework that decouples the training process to enable a focused, single-objective optimization of the planning module. Crucially, RLTR introduces a reward signal based on tool-use completeness to directly evaluate the quality of tool invocation sequences. This method offers a more direct and reliable training signal than assessing the final response content, thereby obviating the need for verifiable data. Our experiments demonstrate that RLTR achieves an 8%-12% improvement in planning performance compared to end-to-end baselines. Moreover, this enhanced planning capability, in turn, translates to a 5%-6% increase in the final response quality of the overall agent system.

Related papers

ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation [54.071574153853994]
ProRAG is a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop.<n>Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism.
arXiv Detail & Related papers (2026-01-29T16:04:59Z)
Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z)
Test-driven Reinforcement Learning [1.1142354615369274]
We propose a Test-driven Reinforcement Learning (TdRL) framework to tackle the reward design challenge in RL.<n>In TdRL, multiple test functions are used to represent the task objective rather than a single reward function.<n>We show that TdRL matches or outperforms handcrafted reward methods in policy training.
arXiv Detail & Related papers (2025-11-11T06:58:52Z)
Demystifying Reinforcement Learning in Agentic Reasoning [90.3737088727791]
We conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning.<n>We highlight our key insights: (i) replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT.<n> Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency.
arXiv Detail & Related papers (2025-10-13T17:57:15Z)
A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks [66.86312354478478]
Agents based on large language models (LLMs) struggle with brainless trial-and-error and generating hallucinatory actions due to a lack of global planning in long-horizon tasks.<n>We introduce a plan-and-execute framework and propose a planner training method to enhance the executor agent's planning abilities without human effort.<n>Experiments show that executor agents equipped with our planner outperform existing methods, achieving new state-of-the-art performance.
arXiv Detail & Related papers (2025-10-07T06:10:53Z)
Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents [35.79575378215309]
Training large language models (LLMs) to reason via reinforcement learning (RL) significantly improves their problem-solving capabilities.<n>We introduce a conceptual framework formalizing dynamic planning for LLM agents, enabling them to flexibly decide when to allocate test-time compute for planning.<n>Experiments on the Crafter environment show that dynamic planning agents trained with this approach are more sample-efficient and consistently achieve more complex objectives.
arXiv Detail & Related papers (2025-09-03T18:00:13Z)
PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning [36.051921179063264]
Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks.<n>Current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories.<n>We introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution.
arXiv Detail & Related papers (2025-08-01T06:17:11Z)
Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards [50.21528417884747]
We introduce Omni-Thinker, a unified reinforcement learning framework that enhances large language models (LLMs) performance across diverse tasks.<n>Our approach enables consistent optimization across task types and scales RL-based training to subjective domains.<n> Experimental results across four domains reveal that curriculum learning improves performance by 5.2% over joint training and 9.1% over model merging.
arXiv Detail & Related papers (2025-07-20T01:50:16Z)
Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z)
InstructRAG: Leveraging Retrieval-Augmented Generation on Instruction Graphs for LLM-Based Task Planning [6.75641900721385]
Large language models (LLMs) have enabled their use as agents for planning complex tasks.<n>Retrieval-augmented generation (RAG) offers new opportunities by leveraging external databases to ground generation in retrieved information.<n>We propose InstructRAG, a novel solution within a multi-agent meta-reinforcement learning framework to address these challenges.
arXiv Detail & Related papers (2025-04-17T15:41:39Z)
From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.<n>We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z)
Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment. We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent. We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z)
Efficient Reinforced Feature Selection via Early Stopping Traverse Strategy [36.890295071860166]
We propose a single-agent Monte Carlo based reinforced feature selection (MCRFS) method. We also propose two efficiency improvement strategies, i.e., early stopping (ES) strategy and reward-level interactive (RI) strategy.
arXiv Detail & Related papers (2021-09-29T03:51:13Z)
Reinforcement Learning for Robust Missile Autopilot Design [0.0]
This work is pioneer in proposing Reinforcement Learning as a framework for flight control. Under TRPO's methodology, the collected experience is augmented according to HER, stored in a replay buffer and sampled according to its significance. Results show that it is possible both to achieve the optimal performance and to improve the agent's robustness to uncertainties.
arXiv Detail & Related papers (2020-11-26T09:30:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.