GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
- URL: http://arxiv.org/abs/2508.04349v1
- Date: Wed, 06 Aug 2025 11:42:47 GMT
- Title: GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy
- Authors: Hongze Tan, Jianfei Pan,
- Abstract summary: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning.<n>This paper solves this with textbfDynamic Entropy Weighting.<n>Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.
Related papers
- Group Sequence Policy Optimization [55.40088895148603]
Group Sequence Policy Optimization (GSPO) is a stable, efficient, and performant reinforcement learning algorithm.<n>GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization.
arXiv Detail & Related papers (2025-07-24T03:50:32Z) - TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization [73.16975077770765]
Recent advancements in reinforcement learning have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO)<n>It is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO)<n>This work decomposes PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance.
arXiv Detail & Related papers (2025-06-17T14:30:06Z) - Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
We propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity.<n>Specifically, we employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We also reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z) - The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models [99.98293908799731]
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.<n>In practice, we establish a transformation equation R=-a*eH+b between entropy H and downstream performance R.<n>We propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively.
arXiv Detail & Related papers (2025-05-28T17:38:45Z) - Learning Explainable Dense Reward Shapes via Bayesian Optimization [45.34810347865996]
We frame reward shaping as an optimization problem focused on token-level credit assignment.<n>We use explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model.<n>Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines.
arXiv Detail & Related papers (2025-04-22T21:09:33Z) - Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification [19.315342870604113]
Group Relative Policy Optimization was introduced recently and used successfully to train DeepSeek-R1 models.<n>We show in this paper that GRPO with verifiable rewards can be written as a Kullback--Leibler regularized contrastive loss.<n>We show that the fixed point $p*$ is guaranteed to be larger than $p_textref$, thereby demonstrating that GRPO effectively amplifies the probability of success of the policy.
arXiv Detail & Related papers (2025-03-09T14:36:45Z) - Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation [0.276240219662896]
A notable form of entropy regularisation is augmenting the objective with an entropy term, thereby simultaneously optimising the expected return and the entropy.
This framework, known as maximum entropy reinforcement learning (MaxEnt RL), has shown theoretical and empirical successes.
This paper proposes a simple method of separating the entropy objective from the MaxEnt RL objective, which facilitates the implementation of MaxEnt RL in on-policy settings.
arXiv Detail & Related papers (2024-07-25T15:48:24Z) - DPO Meets PPO: Reinforced Token Optimization for RLHF [35.638723885233475]
We introduce an algorithm that learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal.<n>Experiments demonstrate that textttRTO performs better than PPO and other direct preference learning algorithms.
arXiv Detail & Related papers (2024-04-29T17:58:30Z) - From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function [50.812404038684505]
We show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation.
We discuss applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
arXiv Detail & Related papers (2024-04-18T17:37:02Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Do You Need the Entropy Reward (in Practice)? [29.811723497181486]
It is believed that the regularization imposed by entropy, on both policy improvement and policy evaluation, together contributes to good exploration, training convergence, and robustness of learned policies.
This paper takes a closer look at entropy as an intrinsic reward, by conducting various ablation studies on soft actor-critic (SAC)
Our findings reveal that in general, entropy rewards should be applied with caution to policy evaluation.
arXiv Detail & Related papers (2022-01-28T21:43:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.