Related papers: GTPO: Trajectory-Based Policy Optimization in Large Language Models

GTPO: Trajectory-Based Policy Optimization in Large Language Models

URL: http://arxiv.org/abs/2508.03772v1
Date: Tue, 05 Aug 2025 08:15:01 GMT
Title: GTPO: Trajectory-Based Policy Optimization in Large Language Models
Authors: Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino,
Abstract summary: Policy-based optimizations are widely adopted today for the training and alignment of language models.<n>In this paper, we reveal and analyze two major limitations of GRPO.<n>We introduce GTPO, which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards.
Score: 45.799380822683034
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Policy-based optimizations are widely adopted today for the training and alignment of language models, where one of the most recent and effective approaches is Group-relative Policy Optimization (GRPO). In this paper, we reveals and analyze two major limitations of GRPO: (i) tokens frequently appear in completions with both positive and negative rewards, leading to conflicting gradient updates that can reduce their output probability, even though can be essential for maintaining proper structure; (ii) negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, progressively flattening the output distribution and degrading learning. To address these issues and provide a more stable and effective policy optimization strategy, we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards, protects them by skipping negative updates, while amplifying positive ones. To further prevent policy collapse, GTPO filters out completions whose entropy exceeds a provable threshold. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, validated through multiple experiments on GSM8K, MATH and AIME 2024 benchmarks.

Related papers

On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence [2.8165669455824696]
Group Relative Policy Optimization is a critic-free reinforcement learning algorithm.<n>We show that GRPO update rule estimates the policy gradient at the old policy rather than the current one.<n>We propose a new algorithm: Trajectory level Importance Corrected GRPO.
arXiv Detail & Related papers (2025-08-04T19:01:19Z)
Truncated Proximal Policy Optimization [43.965892659920364]
Truncated Proximal Policy Optimization (T-PPO) improves training efficiency by streamlining policy update and length-restricted response generation.<n>We propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses.<n>We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model.
arXiv Detail & Related papers (2025-06-18T01:21:38Z)
On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z)
CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models [68.26281707780761]
This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models.<n>We show that CPPO achieves up to $8.32times$ speedup on GSM8K and $3.51times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO.
arXiv Detail & Related papers (2025-03-28T11:30:05Z)
Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach [2.8626097661711394]
Reinforcement Learning from Human Feedback has achieved notable success in steering models, but is complex and can be unstable.<n>Recent approaches such as Direct Preference Optimization (DPO) simplify preference-based fine-tuning but may introduce bias or trade-off certain objectives.<n>We propose a Group Relative Policy Optimization framework with a multi-label reward regression model to achieve safe and aligned language generation.
arXiv Detail & Related papers (2025-03-26T05:50:33Z)
Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards [8.455772877963792]
We introduce two key enhancements to PPO: a hybrid policy architecture that combines an offline policy with an online PPO policy, and a reward shaping mechanism using Time Window Temporal Logic (TWTL)<n>We demonstrate the effectiveness of our approach through extensive experiments on an inverted pendulum and a lunar lander environments.
arXiv Detail & Related papers (2024-11-26T20:22:31Z)
Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs. In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective. We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z)
You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data. Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples. We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios. We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z)
Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning. The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z)
Iterative Amortized Policy Optimization [147.63129234446197]
Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control. From the variational inference perspective, policy networks are a form of textitamortized optimization, optimizing network parameters rather than the policy distributions directly. We demonstrate that iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks.
arXiv Detail & Related papers (2020-10-20T23:25:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.