Related papers: Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward

Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward

URL: http://arxiv.org/abs/2506.05433v1
Date: Thu, 05 Jun 2025 09:13:37 GMT
Title: Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward
Authors: Zikang Liu, Tongtian Yue, Yepeng Tang, Longteng Guo, Junxian Cai, Qingbin Liu, Xi Chen, Jing Liu,
Abstract summary: We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefixes via a Shared-Prefix Forward strategy.<n>By restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once.<n>We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO.
Score: 10.640867597958863
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Group Relative Policy Optimization (GRPO) enhances policy learning by computing gradients from relative comparisons among candidate outputs that share a common input prefix. Despite its effectiveness, GRPO introduces substantial computational overhead when processing long shared prefixes, which must be redundantly encoded for each group member. This inefficiency becomes a major scalability bottleneck in long-context learning scenarios. We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefix computation via a Shared-Prefix Forward strategy. In particular, by restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once, while preserving full differentiability and compatibility with end-to-end training. We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO: it yields identical forward outputs and backward gradients, ensuring that the optimization dynamics and final policy performance remain unchanged. Empirically, our experiments confirm that Prefix Grouper achieves consistent results while significantly reducing the computational cost of training, particularly in long-prefix scenarios. The proposed method is fully plug-and-play: it is compatible with existing GRPO-based architectures and can be seamlessly integrated into current training pipelines as a drop-in replacement, requiring no structural modifications and only minimal changes to input construction and attention computation. Prefix Grouper enables the use of larger group sizes under the same computational budget, thereby improving the scalability of GRPO to more complex tasks and larger models. Code is now available at https://github.com/johncaged/PrefixGrouper

Related papers

Group Sequence Policy Optimization [55.40088895148603]
Group Sequence Policy Optimization (GSPO) is a stable, efficient, and performant reinforcement learning algorithm.<n>GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization.
arXiv Detail & Related papers (2025-07-24T03:50:32Z)
Infinite Sampling: Efficient and Stable Grouped RL Training for Large Language Models [9.805174094639785]
Group-based reinforcement learning algorithms have proven effective for fine-tuning large language models (LLMs) with human feedback.<n> generating and storing multiple responses per prompt incurs substantial memory overhead.<n>We propose Infinite Sampling, a framework that enables efficient and stable GRPO training by decoupling group size from GPU memory usage.
arXiv Detail & Related papers (2025-06-28T16:52:29Z)
Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning [11.708197376569016]
Group Relative Policy Optimization ( GRPO) is proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group.<n>It can lead to inaccurate advantage estimates in environments with highly noisy rewards, potentially introducing bias.<n>We propose a model, called Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), by using lightweight Kalman filtering to dynamically estimate the latent reward mean and variance.
arXiv Detail & Related papers (2025-05-12T13:09:49Z)
CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models [68.26281707780761]
This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models.<n>We show that CPPO achieves up to $8.32times$ speedup on GSM8K and $3.51times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO.
arXiv Detail & Related papers (2025-03-28T11:30:05Z)
Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization [4.158255103170876]
GFlowNets are a family of generative models that learn to sample objects proportional to a given reward function.<n>Recent results show a close relationship between GFlowNet training and entropy-regularized reinforcement learning problems.<n>We introduce a simple backward policy optimization algorithm that involves direct sequentially of the value function in an entropy-regularized Markov Decision Process.
arXiv Detail & Related papers (2024-10-20T19:12:14Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation [121.0693322732454]
Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods to enhance CLIP's performance in downstream tasks. We revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP.
arXiv Detail & Related papers (2024-02-06T15:45:27Z)
BatchGFN: Generative Flow Networks for Batch Active Learning [80.73649229919454]
BatchGFN is a novel approach for pool-based active learning that uses generative flow networks to sample sets of data points proportional to a batch reward. We show our approach enables principled sampling near-optimal utility batches at inference time with a single forward pass per point in the batch in toy regression problems.
arXiv Detail & Related papers (2023-06-26T20:41:36Z)
Towards Learning Universal Hyperparameter Optimizers with Transformers [57.35920571605559]
We introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction. Our experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates.
arXiv Detail & Related papers (2022-05-26T12:51:32Z)
Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction [19.08180531016811]
We develop a novel framework that adds regularizers of the sparse group lasso to a family of adaptives in deep learning.<n>We establish proven convergence guarantees in the theoretically convex settings.<n>Our methods can achieve extremely high sparsity with significantly better or highly competitive performance.
arXiv Detail & Related papers (2021-07-30T05:33:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.