CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models
- URL: http://arxiv.org/abs/2503.22342v1
- Date: Fri, 28 Mar 2025 11:30:05 GMT
- Title: CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models
- Authors: Zhihang Lin, Mingbao Lin, Yuan Xie, Rongrong Ji,
- Abstract summary: This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models.<n>We show that CPPO achieves up to $8.32times$ speedup on GSM8K and $3.51times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO.
- Score: 68.26281707780761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need for sampling multiple completions for each question. Our experiment and theoretical analysis reveals that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experimental results demonstrate that CPPO achieves up to $8.32\times$ speedup on GSM8K and $3.51\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at https://github.com/lzhxmu/CPPO.
Related papers
- GTPO: Trajectory-Based Policy Optimization in Large Language Models [45.799380822683034]
Policy-based optimizations are widely adopted today for the training and alignment of language models.<n>In this paper, we reveal and analyze two major limitations of GRPO.<n>We introduce GTPO, which identifies conflict tokens, tokens appearing in the same position across completions with opposite rewards.
arXiv Detail & Related papers (2025-08-05T08:15:01Z) - Truncated Proximal Policy Optimization [43.965892659920364]
Truncated Proximal Policy Optimization (T-PPO) improves training efficiency by streamlining policy update and length-restricted response generation.<n>We propose Extended Generalized Advantage Estimation (EGAE) for advantage estimation derived from incomplete responses.<n>We demonstrate the effectiveness and efficacy of T-PPO on AIME 2024 with a 32B base model.
arXiv Detail & Related papers (2025-06-18T01:21:38Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.
We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.
Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z) - What's Behind PPO's Collapse in Long-CoT? Value Optimization Holds the Secret [3.410112345043215]
We propose Value-Calibrated PPO (VC-PPO) to address these issues.<n>Experiments on the American Invitational Mathematics Examination (AIME) show that VC-PPO significantly boosts PPO performance.
arXiv Detail & Related papers (2025-03-03T12:59:25Z) - Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance [52.65461207786633]
Policy-based Reinforcement Learning from Human Feedback is essential for aligning large language models with human preferences.<n>It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance.<n>We propose textbfDecoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained emphglobal value model (GVM)
arXiv Detail & Related papers (2025-02-24T08:11:33Z) - VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates.
We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - A dynamical clipping approach with task feedback for Proximal Policy Optimization [29.855219523565786]
There is no theoretical proof that the optimal PPO clipping bound remains consistent throughout the entire training process.
Past studies have aimed to dynamically adjust PPO clipping bound to enhance PPO's performance.
We propose Preference based Proximal Policy Optimization (Pb-PPO) to better reflect the preference (maximizing Return) of reinforcement learning tasks.
arXiv Detail & Related papers (2023-12-12T06:35:56Z) - You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data.
Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples.
We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios.
We show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
arXiv Detail & Related papers (2022-01-31T20:26:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.