Related papers: Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models

Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models

URL: http://arxiv.org/abs/2602.03309v1
Date: Tue, 03 Feb 2026 09:38:21 GMT
Title: Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models
Authors: Yuelin Hu, Zhengxue Cheng, Wei Liu, Li Song,
Abstract summary: Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level.<n>We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation.<n> EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi baseline, while incurring only 3.4 percent additional computational overhead.
Score: 18.084251607403406
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level. We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation. Stage 1, SFT expert learning, establishes a reliable warm up policy using expert demonstrations with a pure SFT loss. Stage 2, RL rollout generation, samples trajectories from the current policy and computes per token predictive entropy. Stage 3, the EGSPO mechanism, applies entropy gated gradient allocation: a predictive entropy module routes high entropy tokens to full PPO updates to encourage exploration, and low entropy tokens to attenuated PPO updates to reduce variance and preserve knowledge. Critically, both branches incorporate the advantage function A_t, ensuring that incorrect trajectories receive consistent negative learning signals and preventing reinforcement of confident errors. EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi baseline, while incurring only 3.4 percent additional computational overhead.

Related papers

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z)
Policy Gradient with Adaptive Entropy Annealing for Continual Fine-Tuning [18.440289150575648]
We propose a training strategy that transitions from exploratory (CE-like) to exploitative (EPG-like) learning.<n>We evaluate various entropy regularization methods and demonstrate that lower entropy of the output prediction distribution enhances adaptation in pretrained vision models.
arXiv Detail & Related papers (2026-02-15T10:05:03Z)
SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning [54.393763477932474]
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs)<n>We propose SED-SFT, which adaptively encourages diversity based on the token exploration space.<n>This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective.
arXiv Detail & Related papers (2026-02-07T09:39:21Z)
GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning [14.111530312590531]
Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI)<n>We propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO)<n>GDEPO incorporates three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases.
arXiv Detail & Related papers (2026-01-11T07:34:41Z)
ESPO: Entropy Importance Sampling Policy Optimization [7.2000276975120014]
Entropy Importance Sampling Policy Optimization reconciles fine-grained control with training stability.<n> ESPO decomposes sequences into groups based on predictive entropy.<n>Experiments on mathematical reasoning benchmarks demonstrate that ESPO achieves convergence and state-of-the-art performance.
arXiv Detail & Related papers (2025-11-29T14:09:38Z)
From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training [38.8378349968766]
Reinforcement Learning with Verifiable Rewards (RLVR) is highly dependent on high-quality labeled data.<n>Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels.<n>We propose a novel two-stage, token-level entropy optimization method for RLVR.
arXiv Detail & Related papers (2025-11-11T01:42:37Z)
Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL [56.085103402298905]
We propose a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges.<n>Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates.<n>We develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements.
arXiv Detail & Related papers (2025-10-25T09:17:47Z)
Agentic Entropy-Balanced Policy Optimization [114.90524574220764]
Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents.<n>While RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints.<n>We propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases.
arXiv Detail & Related papers (2025-10-16T10:40:52Z)
Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning [71.30276778807068]
We propose a unified framework that strategically coordinates sample pruning and token pruning.<n>Q-Tuning achieves a +38% average improvement over the full-data SFT baseline using only 12.5% of the original training data.
arXiv Detail & Related papers (2025-09-28T13:27:38Z)
CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning [28.02073546326571]
Policy entropy reflects the balance between exploration and exploitation during training.<n>Existing methods discard valuable gradient signals from low-probability tokens due to the clipping mechanism.<n>We propose textbfCoordinating textbfEntropy via textbfGradient textbfPreserving textbfPolicy textbfOptimization.
arXiv Detail & Related papers (2025-09-25T03:22:04Z)
From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature [38.46122853450324]
Existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process.<n>We introduce Heterogeneous Adaptive Policy Optimization (HAPO), a token-aware algorithm that dynamically adapts optimization based on token entropy.
arXiv Detail & Related papers (2025-09-20T09:30:25Z)
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning [41.83677588934301]
We propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA)<n>BCPG-NSA is a fine-grained offline framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples.<n> Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset.
arXiv Detail & Related papers (2025-05-20T14:16:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.