GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
- URL: http://arxiv.org/abs/2508.04349v4
- Date: Mon, 18 Aug 2025 14:20:57 GMT
- Title: GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
- Authors: Hongze Tan, Jianfei Pan,
- Abstract summary: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning.<n>This paper solves this with textbfDynamic Entropy Weighting.<n>Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.
Related papers
- Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities [10.235183326885794]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs)<n>We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths.<n>We propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses.
arXiv Detail & Related papers (2026-02-05T04:06:55Z) - Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization [52.74762030521324]
We propose a novel algorithm to learn reward functions from observed actions.<n>We provide strong theoretical guarantees for the reliability and sample efficiency of our algorithm.
arXiv Detail & Related papers (2026-01-19T04:12:51Z) - Owen-Shapley Policy Optimization (OSPO): A Principled RL Algorithm for Generative Search LLMs [12.75200353208858]
Owen-Shapley Policy Optimization (OSPO) is a framework that redistributes sequence-level advantages based on tokens' marginal contributions to outcomes.<n>Unlike value-model-based methods requiring additional computation, OSPO employs potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit.<n> Experiments on Amazon ESCI and H&M Fashion datasets show consistent gains over baselines.
arXiv Detail & Related papers (2026-01-13T10:17:46Z) - Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR [31.43482175098666]
Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models in reasoning tasks.<n>Existing RLVR algorithms focus on different granularities, and each has complementary strengths and limitations.<n>We propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a single clipped surrogate objective.
arXiv Detail & Related papers (2026-01-09T07:57:40Z) - Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning [30.889495810312624]
We propose an efficient reinforcement learning framework that leverages entropy signals at both the semantic and token levels to improve reasoning.<n>By jointly optimizing data organization and algorithmic design, our method effectively mitigates entropy collapse and enhances reasoning.
arXiv Detail & Related papers (2025-12-04T01:09:17Z) - Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective [85.06838178922791]
Reinforcement Learning (RL) has proven highly effective for autoregressive language models.<n>But adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges.<n>We propose a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy.
arXiv Detail & Related papers (2025-12-03T13:05:32Z) - Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z) - Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning [55.59724323303857]
We propose a framework that balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment.<n>Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
arXiv Detail & Related papers (2025-10-13T03:10:26Z) - CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning [28.02073546326571]
Policy entropy reflects the balance between exploration and exploitation during training.<n>Existing methods discard valuable gradient signals from low-probability tokens due to the clipping mechanism.<n>We propose textbfCoordinating textbfEntropy via textbfGradient textbfPreserving textbfPolicy textbfOptimization.
arXiv Detail & Related papers (2025-09-25T03:22:04Z) - Group Sequence Policy Optimization [55.40088895148603]
Group Sequence Policy Optimization (GSPO) is a stable, efficient, and performant reinforcement learning algorithm.<n>GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization.
arXiv Detail & Related papers (2025-07-24T03:50:32Z) - TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization [73.16975077770765]
Recent advancements in reinforcement learning have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO)<n>It is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO)<n>This work decomposes PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance.
arXiv Detail & Related papers (2025-06-17T14:30:06Z) - Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z) - The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models [99.98293908799731]
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.<n>In practice, we establish a transformation equation R=-a*eH+b between entropy H and downstream performance R.<n>We propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively.
arXiv Detail & Related papers (2025-05-28T17:38:45Z) - Learning Explainable Dense Reward Shapes via Bayesian Optimization [45.34810347865996]
We frame reward shaping as an optimization problem focused on token-level credit assignment.<n>We use explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model.<n>Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines.
arXiv Detail & Related papers (2025-04-22T21:09:33Z) - Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification [19.315342870604113]
Group Relative Policy Optimization was introduced recently and used successfully to train DeepSeek-R1 models.<n>We show in this paper that GRPO with verifiable rewards can be written as a Kullback--Leibler regularized contrastive loss.<n>We show that the fixed point $p*$ is guaranteed to be larger than $p_textref$, thereby demonstrating that GRPO effectively amplifies the probability of success of the policy.
arXiv Detail & Related papers (2025-03-09T14:36:45Z) - Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation [0.276240219662896]
A notable form of entropy regularisation is augmenting the objective with an entropy term, thereby simultaneously optimising the expected return and the entropy.
This framework, known as maximum entropy reinforcement learning (MaxEnt RL), has shown theoretical and empirical successes.
This paper proposes a simple method of separating the entropy objective from the MaxEnt RL objective, which facilitates the implementation of MaxEnt RL in on-policy settings.
arXiv Detail & Related papers (2024-07-25T15:48:24Z) - DPO Meets PPO: Reinforced Token Optimization for RLHF [35.638723885233475]
We introduce an algorithm that learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal.<n>Experiments demonstrate that textttRTO performs better than PPO and other direct preference learning algorithms.
arXiv Detail & Related papers (2024-04-29T17:58:30Z) - From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function [50.812404038684505]
We show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation.
We discuss applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
arXiv Detail & Related papers (2024-04-18T17:37:02Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Do You Need the Entropy Reward (in Practice)? [29.811723497181486]
It is believed that the regularization imposed by entropy, on both policy improvement and policy evaluation, together contributes to good exploration, training convergence, and robustness of learned policies.
This paper takes a closer look at entropy as an intrinsic reward, by conducting various ablation studies on soft actor-critic (SAC)
Our findings reveal that in general, entropy rewards should be applied with caution to policy evaluation.
arXiv Detail & Related papers (2022-01-28T21:43:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.