Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization
- URL: http://arxiv.org/abs/2603.04135v1
- Date: Wed, 04 Mar 2026 14:48:53 GMT
- Title: Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization
- Authors: Haodong Zhu, Yangyang Ren, Yanjing Li, Mingbao Lin, Linlin Yang, Xuhui Liu, Xiantong Zhen, Haiguang Liu, Baochang Zhang,
- Abstract summary: Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
- Score: 60.87651283510059
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs due to its extensive group-based sampling requirement. While recent selective data utilization methods can mitigate this overhead, they could induce estimation bias by altering the underlying sampling distribution, compromising theoretical rigor and convergence behavior. To address this limitation, we propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation through importance sampling-based correction. By incorporating mathematically derived rescaling factors, DPPO significantly accelerates GRPO training without altering the optimization objective of the full-batch baseline. Furthermore, to mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy that maximizes valid token density and hardware utilization. Extensive experiments demonstrate that DPPO consistently accelerates training across diverse models and benchmarks. For instance, on Qwen3-4B trained on MATH, DPPO achieves 2.37$\times$ training speedup and outperforms GRPO by 3.36% in average accuracy across six mathematical reasoning benchmarks.
Related papers
- Difficulty-Estimated Policy Optimization [38.86673795561421]
We propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment.<n>Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling.
arXiv Detail & Related papers (2026-02-06T04:12:23Z) - Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - OptPO: Optimal Rollout Allocation for Test-time Policy Optimization [11.375209834858135]
Test-time policy optimization enables large language models to adapt to distribution shifts by leveraging feedback from self-generated rollouts.<n>We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets.
arXiv Detail & Related papers (2025-12-02T15:38:52Z) - Reinforcing Diffusion Models by Direct Group Preference Optimization [19.195805549362074]
Group Preference Optimization (DGPO) learns directly from group-level preferences, which utilize relative information of samples within groups.<n>Results show that DGPO trains around 20 times faster than existing state-of-the-art methods and superior performance on both in-domain and out-of-domain metrics reward.
arXiv Detail & Related papers (2025-10-09T16:40:43Z) - Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization [0.0]
Margin-Adaptive Direct Preference Optimization provides a stable, data-preserving, and instance-level solution.<n>We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape.<n>It achieves performance gains of up to +33.3% on High Quality data and +10.5% on Low Quality data over the next-best method.
arXiv Detail & Related papers (2025-10-06T20:09:37Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models [77.16976971950785]
This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models.<n>CPPO prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates.<n> Experiments show that CPPO achieves up to $7.98times$ speedup on GSM8K and $3.48times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO.
arXiv Detail & Related papers (2025-03-28T11:30:05Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.