Related papers: ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models

ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models

URL: http://arxiv.org/abs/2509.21826v1
Date: Fri, 26 Sep 2025 03:38:27 GMT
Title: ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models
Authors: Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Guojun Yin, Wei Lin, Ran He,
Abstract summary: Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools.<n>textbfReshaped textbfToken-level policy gradients (textbfResT) for tool-use tasks.<n>textbfResT achieves state-of-the-art results, outperforming prior methods by up to $8.76%$.
Score: 62.82372407840088
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose \textbf{Res}haped \textbf{T}oken-level policy gradients (\textbf{ResT}) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to $8.76\%$. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by $4.11\%$ on single-turn tasks and $1.50\%$ on multi-turn base tasks.

Related papers

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs [19.079556051442168]
Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks.<n>But for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly noisy.<n>We propose a stabilization method for REINFORCE/ GRPO-style algorithms that scales learning rate based on effective sample size to dampen unreliable updates.
arXiv Detail & Related papers (2026-02-19T18:40:51Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning [31.369103012768964]
DISPO is a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses.<n>We show that DISPO achieves 61.04% on AIME'24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.
arXiv Detail & Related papers (2026-02-01T02:45:04Z)
A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z)
Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning [48.34492357368989]
We propose a primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse.<n>$R2VPO$ achieves superior performance with average relative gains of up to 17% over strong clipping-based baselines.
arXiv Detail & Related papers (2026-01-06T14:01:42Z)
AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards [60.2998874976509]
We propose advantage-weighted policy optimization (AWPO) to integrate explicit reasoning rewards to enhance tool-use capability.<n>AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals.<n>Experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks.
arXiv Detail & Related papers (2025-12-22T08:07:00Z)
Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization [13.475938754147625]
Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks.<n>Agentic Reinforcement Learning (Agentic RL) optimize such models over full tool-interaction trajectories.<n>Two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence.<n>We propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO).
arXiv Detail & Related papers (2025-12-08T11:59:25Z)
Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning [49.57517969069136]
We introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings.<n>AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards.<n>It consistently improves learning stability and performance across multiple benchmarks over strong baselines.
arXiv Detail & Related papers (2025-10-02T04:24:27Z)
Mind the Gap: Data Rewriting for Stable Off-Policy Supervised Fine-Tuning [33.899779762210976]
Supervised fine-tuning (SFT) of large language models can be viewed as an off-policy learning problem.<n>Existing methods mitigate this issue with KL penalties or clipping, which passively updates rather than actively reducing the gap.<n>We propose a simple yet effective data rewriting framework that proactively shrinks the policy gap before training.
arXiv Detail & Related papers (2025-09-18T17:02:30Z)
Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling [35.97270347306353]
We propose textbfEGPO, a new RL framework built upon Group Relative Policy Optimization (GRPO)<n>The core of EGPO is an entropy-enhanced advantage function that integrates the entropy of the model's Chain-of-Thought (CoT) into the policy gradient.<n>On the challenging Berkeley Function Calling Leaderboard (BFCL), a 4B- parameter model trained with EGPO sets a new state-of-the-art among models of comparable size.
arXiv Detail & Related papers (2025-08-07T07:51:38Z)
Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards [17.695285420477035]
We study the intermediate range of algorithms between off-policy RL and supervised fine-tuning.<n>We first provide a theoretical analysis of this off-policy REINFORCE algorithm.<n>Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones.
arXiv Detail & Related papers (2025-06-25T15:07:16Z)
BNPO: Beta Normalization Policy Optimization [9.60676665395923]
We propose a novel policy optimization method that adaptively normalizes rewards using a Beta distribution with dynamically updated parameters.<n>We provide theoretical analysis demonstrating BNPO's variance-reducing properties and show that it generalizes both REINFORCE and GRPO under binary-valued reward settings.<n> Experimental results confirm that BNPO achieves state-of-the-art performance among policy optimization methods on reasoning tasks.
arXiv Detail & Related papers (2025-06-03T13:28:57Z)
On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z)
Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization [11.642505299142956]
Continuous control of non-stationary environments is a major challenge for deep reinforcement learning algorithms.<n>We show that performing on-policy reinforcement learning with an evidential critic provides both of these properties.<n>We name the resulting algorithm as $textit Evidential Proximal Policy Optimization (EPPO)$ due to the integral role of evidential uncertainty in both policy evaluation and policy improvement stages.
arXiv Detail & Related papers (2025-03-03T12:23:07Z)
Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization [10.465789490644031]
We propose a novel framework for robust regularized Markov decision process ($d$-RRMDP)<n>For the offline RL setting, we develop a family of algorithms, Robust Regularized Pessimistic Value Iteration (R2PVI)
arXiv Detail & Related papers (2024-11-27T18:57:03Z)
Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z)
DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled. Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step. Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.