Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning
- URL: http://arxiv.org/abs/2503.15952v1
- Date: Thu, 20 Mar 2025 08:48:57 GMT
- Title: Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning
- Authors: Chen Li, Nazhou Liu, Kai Yang,
- Abstract summary: We propose Adaptive Group Policy Optimization (AGPO) which contains two simple but effective modifications.<n>The experiments demonstrate our methods achieve more stable training and comparable or superior performance with significantly fewer tokens in reasoning steps.
- Score: 4.325768677318839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of Reasoning LLMs training. However, we find some deficiency that influences RL stability and inference efficiency. Thus, we propose Adaptive Group Policy Optimization (AGPO) which contains two simple but effective modifications: a revised advantage estimation method to mitigate zero-variance situations; a length-based reward, incentivizing the model to avoid overthinking. The experiments demonstrate our methods achieve more stable training and comparable or superior performance with significantly fewer tokens in reasoning steps.
Related papers
- GVPO: Group Variance Policy Optimization for Large Language Model Post-Training [18.431007107428574]
Group Variance Policy Optimization (GVPO) incorporates the analytical solution to KL-constrained reward directly into its weights.
GVPO offers two key advantages: it guarantees a unique optimal solution, exactly the KL-constrained reward objective, and it supports flexible sampling distributions.
By unifying theoretical guarantees with practical adaptability, GVPO establishes a new paradigm for reliable and versatile LLM post-training.
arXiv Detail & Related papers (2025-04-28T09:02:24Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.
We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.
Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z) - Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.
Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.
We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z) - PEO: Improving Bi-Factorial Preference Alignment with Post-Training Policy Extrapolation [5.347428263669927]
Post-Training Extrapolation Optimization (PEO) is a novel and efficient framework for bi-factorial alignment.<n>PEO generates a family of optimal policies in a single training pass by leveraging a three-phase pipeline.
arXiv Detail & Related papers (2025-03-03T06:56:39Z) - Teaching LLMs to Refine with Tools [68.23479664749271]
Large language models (LLMs) can refine their responses based on feedback, enabling self-improvement through iterative training or test-time refinement.<n>We propose CaP, a novel approach that uses external tools to refine chain-of-thought (CoT) responses generated by the same or other LLMs.
arXiv Detail & Related papers (2024-12-22T05:43:50Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - Group Robust Preference Optimization in Reward-free RLHF [23.622835830345725]
We propose a novel Group Robust Preference Optimization (GRPO) method to align large language models to individual groups' preferences robustly.
To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss.
We significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies.
arXiv Detail & Related papers (2024-05-30T17:50:04Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - DPO: Differential reinforcement learning with application to optimal configuration search [3.2857981869020327]
Reinforcement learning with continuous state and action spaces remains one of the most challenging problems within the field.
We propose the first differential RL framework that can handle settings with limited training samples and short-length episodes.
arXiv Detail & Related papers (2024-04-24T03:11:12Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.