Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2510.03669v2
- Date: Sat, 11 Oct 2025 22:16:29 GMT
- Title: Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
- Authors: Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis,
- Abstract summary: We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses.<n>We find that training dynamics are dominated by a small subset of tokens with high absolute THR values.<n>This insight suggests a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration.
- Score: 64.04741347596938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.
Related papers
- Reinforcement Learning with Promising Tokens for Large Language Models [11.420715885411925]
Reinforcement learning (RL) has emerged as a key paradigm for aligning and optimizing large language models (LLMs)<n>We introduce Reinforcement Learning with Promising Tokens (R), a framework that mitigates the action space issue by decoupling strategic decision-making from token generation.
arXiv Detail & Related papers (2026-02-03T07:08:06Z) - From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z) - Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning [60.00161035836637]
Group Relative Policy Optimization has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks.<n>We introduce Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer.<n>OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline.
arXiv Detail & Related papers (2026-01-12T10:48:02Z) - Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models [18.785063555637613]
Group Relative Policy Optimization (GRPO) has demonstrated strong performance.<n>It suffers from a critical issue: low-probability tokens disproportionately dominate gradient updates.<n>This imbalance leads to unstable training and suppresses the contribution of high-probability tokens.
arXiv Detail & Related papers (2025-10-29T08:07:47Z) - ASPO: Asymmetric Importance Sampling Policy Optimization [31.38346888572171]
The Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens.<n>This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones.<n>We propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens.
arXiv Detail & Related papers (2025-10-07T15:54:24Z) - Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards [17.695285420477035]
We study the intermediate range of algorithms between off-policy RL and supervised fine-tuning.<n>We first provide a theoretical analysis of this off-policy REINFORCE algorithm.<n>Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones.
arXiv Detail & Related papers (2025-06-25T15:07:16Z) - Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective [6.069069082518759]
We study the Zero-Reward Assumption in reinforcement learning for large language models (LLMs)<n>We show that the policy gradient based on true, unknown token-level rewards can be unbiasedly estimated using only a response-level reward model.<n>We propose a new algorithm: Token-Reinforced Policy Optimization (TRePO)
arXiv Detail & Related papers (2025-06-03T07:44:31Z) - Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z) - On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization [52.76330545825083]
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs)<n>We identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training.<n>We develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens.
arXiv Detail & Related papers (2025-05-24T18:58:51Z) - Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z) - Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework [1.5802986215292307]
Language Model Guided reward Tuning (LMGT) is a novel, sample-efficient framework for Reinforcement Learning.<n>We show that LMGT adeptly balances exploration and exploitation, thereby guiding the agent's exploratory behavior.<n>Our results suggest that LMGT can substantially reduce the computational resources required during the RL training phase.
arXiv Detail & Related papers (2024-09-07T07:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.