ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models
- URL: http://arxiv.org/abs/2509.21826v1
- Date: Fri, 26 Sep 2025 03:38:27 GMT
- Title: ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models
- Authors: Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Guojun Yin, Wei Lin, Ran He,
- Abstract summary: Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools.<n>textbfReshaped textbfToken-level policy gradients (textbfResT) for tool-use tasks.<n>textbfResT achieves state-of-the-art results, outperforming prior methods by up to $8.76%$.
- Score: 62.82372407840088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose \textbf{Res}haped \textbf{T}oken-level policy gradients (\textbf{ResT}) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to $8.76\%$. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by $4.11\%$ on single-turn tasks and $1.50\%$ on multi-turn base tasks.
Related papers
- Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs [19.079556051442168]
Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks.<n>But for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly noisy.<n>We propose a stabilization method for REINFORCE/ GRPO-style algorithms that scales learning rate based on effective sample size to dampen unreliable updates.
arXiv Detail & Related papers (2026-02-19T18:40:51Z) - Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning [31.369103012768964]
DISPO is a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses.<n>We show that DISPO achieves 61.04% on AIME'24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.
arXiv Detail & Related papers (2026-02-01T02:45:04Z) - A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning [48.34492357368989]
We propose a primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse.<n>$R2VPO$ achieves superior performance with average relative gains of up to 17% over strong clipping-based baselines.
arXiv Detail & Related papers (2026-01-06T14:01:42Z) - AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards [60.2998874976509]
We propose advantage-weighted policy optimization (AWPO) to integrate explicit reasoning rewards to enhance tool-use capability.<n>AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals.<n>Experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks.
arXiv Detail & Related papers (2025-12-22T08:07:00Z) - Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization [13.475938754147625]
Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks.<n>Agentic Reinforcement Learning (Agentic RL) optimize such models over full tool-interaction trajectories.<n>Two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence.<n>We propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO).
arXiv Detail & Related papers (2025-12-08T11:59:25Z) - Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning [49.57517969069136]
We introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings.<n>AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards.<n>It consistently improves learning stability and performance across multiple benchmarks over strong baselines.
arXiv Detail & Related papers (2025-10-02T04:24:27Z) - Mind the Gap: Data Rewriting for Stable Off-Policy Supervised Fine-Tuning [33.899779762210976]
Supervised fine-tuning (SFT) of large language models can be viewed as an off-policy learning problem.<n>Existing methods mitigate this issue with KL penalties or clipping, which passively updates rather than actively reducing the gap.<n>We propose a simple yet effective data rewriting framework that proactively shrinks the policy gap before training.
arXiv Detail & Related papers (2025-09-18T17:02:30Z) - Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling [35.97270347306353]
We propose textbfEGPO, a new RL framework built upon Group Relative Policy Optimization (GRPO)<n>The core of EGPO is an entropy-enhanced advantage function that integrates the entropy of the model's Chain-of-Thought (CoT) into the policy gradient.<n>On the challenging Berkeley Function Calling Leaderboard (BFCL), a 4B- parameter model trained with EGPO sets a new state-of-the-art among models of comparable size.
arXiv Detail & Related papers (2025-08-07T07:51:38Z) - Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards [17.695285420477035]
We study the intermediate range of algorithms between off-policy RL and supervised fine-tuning.<n>We first provide a theoretical analysis of this off-policy REINFORCE algorithm.<n>Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones.
arXiv Detail & Related papers (2025-06-25T15:07:16Z) - BNPO: Beta Normalization Policy Optimization [9.60676665395923]
We propose a novel policy optimization method that adaptively normalizes rewards using a Beta distribution with dynamically updated parameters.<n>We provide theoretical analysis demonstrating BNPO's variance-reducing properties and show that it generalizes both REINFORCE and GRPO under binary-valued reward settings.<n> Experimental results confirm that BNPO achieves state-of-the-art performance among policy optimization methods on reasoning tasks.
arXiv Detail & Related papers (2025-06-03T13:28:57Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization [11.642505299142956]
Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms.<n>We show that performing on-policy reinforcement learning with an evidential critic provides both of these properties.<n>We name the resulting algorithm as $textit Evidential Proximal Policy Optimization (EPPO)$ due to the integral role of evidential uncertainty in both policy evaluation and policy improvement stages.
arXiv Detail & Related papers (2025-03-03T12:23:07Z) - Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization [10.465789490644031]
We propose a novel framework for robust regularized Markov decision process ($d$-RRMDP)<n>For the offline RL setting, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI)
arXiv Detail & Related papers (2024-11-27T18:57:03Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.