From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature
- URL: http://arxiv.org/abs/2509.16591v1
- Date: Sat, 20 Sep 2025 09:30:25 GMT
- Title: From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature
- Authors: Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, Wentao Zhang,
- Abstract summary: Existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process.<n>We introduce Heterogeneous Adaptive Policy Optimization (HAPO), a token-aware algorithm that dynamically adapts optimization based on token entropy.
- Score: 38.46122853450324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose Adaptive Temperature Sampling, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce Token Level Group Average that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop Differential Advantage Redistribution that leverages entropy and importance ratios to modulate rewards-adjusting updates for tokens with clear signals. For clipping loss, we design Asymmetric Adaptive Clipping, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales. Our code can be found in https://github.com/starriver030515/HAPO.
Related papers
- STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens [38.425692691443764]
ExistingReinforcement Learning (RL) fine-tuning methods rely heavily on entropy regularization and reweighting to maintain stability.<n>In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training.<n>We find that training instability can be caused by a tiny fraction of tokens, approximately 0.01%, which we term spurious tokens.<n>We propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement.
arXiv Detail & Related papers (2026-02-17T14:46:48Z) - Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models [18.084251607403406]
Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level.<n>We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation.<n> EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi baseline, while incurring only 3.4 percent additional computational overhead.
arXiv Detail & Related papers (2026-02-03T09:38:21Z) - Probability-Entropy Calibration: An Elastic Indicator for Adaptive Fine-tuning [55.2818264614932]
RankTuner introduces a probability--entropy calibration signal, the Relative Rank Indicator, which compares the rank of the ground-truth token with its expected rank under the prediction distribution.<n>The inverse indicator is used as a token-wise Relative Scale to reweight the fine-tuning objective, focusing updates on truly under-learned tokens.
arXiv Detail & Related papers (2026-02-02T07:27:19Z) - Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning [30.889495810312624]
We propose an efficient reinforcement learning framework that leverages entropy signals at both the semantic and token levels to improve reasoning.<n>By jointly optimizing data organization and algorithmic design, our method effectively mitigates entropy collapse and enhances reasoning.
arXiv Detail & Related papers (2025-12-04T01:09:17Z) - Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective [85.06838178922791]
Reinforcement Learning (RL) has proven highly effective for autoregressive language models.<n>But adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges.<n>We propose a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy.
arXiv Detail & Related papers (2025-12-03T13:05:32Z) - GateRA: Token-Aware Modulation for Parameter-Efficient Fine-Tuning [51.79350934271497]
GateRA is a unified framework that introduces token-aware modulation to dynamically adjust the strength of PEFT updates.<n>By incorporating adaptive gating into standard PEFT branches, GateRA enables selective, token-level adaptation.<n> Experiments on multiple commonsense reasoning benchmarks demonstrate that GateRA consistently outperforms or matches prior PEFT methods.
arXiv Detail & Related papers (2025-11-15T17:55:47Z) - ASPO: Asymmetric Importance Sampling Policy Optimization [31.38346888572171]
The Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens.<n>This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones.<n>We propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens.
arXiv Detail & Related papers (2025-10-07T15:54:24Z) - CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning [28.02073546326571]
Policy entropy reflects the balance between exploration and exploitation during training.<n>Existing methods discard valuable gradient signals from low-probability tokens due to the clipping mechanism.<n>We propose textbfCoordinating textbfEntropy via textbfGradient textbfPreserving textbfPolicy textbfOptimization.
arXiv Detail & Related papers (2025-09-25T03:22:04Z) - GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy [0.0]
Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning.<n>This paper solves this with textbfDynamic Entropy Weighting.<n>Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling.
arXiv Detail & Related papers (2025-08-06T11:42:47Z) - Learning Explainable Dense Reward Shapes via Bayesian Optimization [45.34810347865996]
We frame reward shaping as an optimization problem focused on token-level credit assignment.<n>We use explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model.<n>Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines.
arXiv Detail & Related papers (2025-04-22T21:09:33Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.