Related papers: CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

URL: http://arxiv.org/abs/2509.20712v4
Date: Wed, 15 Oct 2025 13:41:22 GMT
Title: CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
Authors: Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou,
Abstract summary: Policy entropy reflects the balance between exploration and exploitation during training.<n>Existing methods discard valuable gradient signals from low-probability tokens due to the clipping mechanism.<n>We propose textbfCoordinating textbfEntropy via textbfGradient textbfPreserving textbfPolicy textbfOptimization.
Score: 28.02073546326571
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}oordinating \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

Related papers

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z)
Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning [18.440289150575648]
We propose a training strategy that transitions from exploratory (CE-like) to exploitative (EPG-like) learning.<n>We evaluate various entropy regularization methods and demonstrate that lower entropy of the output prediction distribution enhances adaptation in pretrained vision models.
arXiv Detail & Related papers (2026-02-15T10:05:03Z)
Difficulty-Estimated Policy Optimization [38.86673795561421]
We propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment.<n>Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling.
arXiv Detail & Related papers (2026-02-06T04:12:23Z)
Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities [10.235183326885794]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs)<n>We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths.<n>We propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses.
arXiv Detail & Related papers (2026-02-05T04:06:55Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models [54.61810451777578]
Entropy serves as a critical metric for measuring the diversity of outputs generated by large language models.<n>Recent studies increasingly focus on monitoring and adjusting entropy to better balance exploration and exploitation in reinforcement fine-tuning.
arXiv Detail & Related papers (2026-02-03T11:14:58Z)
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning [49.92803982100042]
We propose using the entropy ratio between the current and previous policies as a new global metric.<n>We introduce an textbfEntropy Ratio (ERC) mechanism that imposes bidirectional constraints on the entropy ratio.<n>This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions.
arXiv Detail & Related papers (2025-12-05T10:26:32Z)
GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy [5.691990020513277]
We propose Dynamic Entropy Weighting, a novel mechanism that facilitates fine-grained rewards through two new algorithms.<n>By repurposing policy entropy for reward shaping, we achieve true per-token credit assignment.
arXiv Detail & Related papers (2025-08-06T11:42:47Z)
GTPO: Trajectory-Based Policy Optimization in Large Language Models [42.60363805227946]
Policy-based optimizations are widely adopted today for the training and alignment of language models.<n>In this paper, we reveal and analyze two major limitations of GRPO.<n>We introduce GTPO, which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards.
arXiv Detail & Related papers (2025-08-05T08:15:01Z)
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models [99.98293908799731]
This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy.<n>In practice, we establish a transformation equation R=-a*eH+b between entropy H and downstream performance R.<n>We propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively.
arXiv Detail & Related papers (2025-05-28T17:38:45Z)
On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization [52.76330545825083]
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs)<n>We identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training.<n>We develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens.
arXiv Detail & Related papers (2025-05-24T18:58:51Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.