Related papers: AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum

AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum

URL: http://arxiv.org/abs/2505.14264v1
Date: Tue, 20 May 2025 12:13:44 GMT
Title: AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum
Authors: Jian Xiong, Jingbo Zhou, Jingyong Ye, Dejing Dou,
Abstract summary: Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs)<n>Group relative advantage estimation has attracted considerable attention for eliminating the dependency on the value model.<n>We propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimize the cross-entropy loss using advantages enhanced through a momentum-based estimation scheme.
Score: 45.135858299101386
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited chain-of-thought (CoT) data. Among RL-based post-training methods, group relative advantage estimation, as exemplified by Group Relative Policy Optimization (GRPO), has attracted considerable attention for eliminating the dependency on the value model, thereby simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO). However, we observe that exsiting group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the cross-entropy (CE) loss using advantages enhanced through a momentum-based estimation scheme. This approach effectively mitigates the inefficiencies associated with group relative advantage estimation. Experimental results on multiple mathematical reasoning benchmarks demonstrate the superior performance of AAPO.

Related papers

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization [55.06360285372418]
Group Relative Policy Optimization is a reinforcement learning method for large reasoning models (LRMs)<n>In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias.<n>We introduce a new Discriminative Constrained Optimization framework for reinforcing LRMs, grounded in the principle of discriminative learning.
arXiv Detail & Related papers (2025-05-18T11:08:32Z)
Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning [4.325768677318839]
We propose Adaptive Group Policy Optimization (AGPO) which contains two simple but effective modifications.<n>The experiments demonstrate our methods achieve more stable training and comparable or superior performance with significantly fewer tokens in reasoning steps.
arXiv Detail & Related papers (2025-03-20T08:48:57Z)
Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning [8.087699764574788]
We propose an efficient algorithm for offline preference-based reinforcement learning (PbRL)<n>APPO guarantees sample complexity bounds without relying on explicit confidence sets.<n>To our knowledge, APPO is the first offline PbRL algorithm to offer both statistical efficiency and practical applicability.
arXiv Detail & Related papers (2025-03-07T10:35:01Z)
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning [61.403275660120606]
Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives.<n>We propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method.<n>Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.
arXiv Detail & Related papers (2025-03-02T13:43:53Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem. PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins. Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z)
Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment [37.52249093928251]
This paper proposes a new framework, reinforcement learning with relative feedback, and a novel trajectory-wise policy gradient algorithm. We show theoretically that P3O is invariant to equivalent rewards and avoids the complexity of PPO. Empirical evaluations demonstrate that P3O outperforms PPO in the KL-Reward trade-off and can align with human preferences as well as or better than prior methods.
arXiv Detail & Related papers (2023-09-30T01:23:22Z)
Fine-Tuning Language Models with Advantage-Induced Policy Alignment [80.96507425217472]
We propose a novel algorithm for aligning large language models to human preferences. We show that it consistently outperforms PPO in language tasks by a large margin. We also provide a theoretical justification supporting the design of our loss function.
arXiv Detail & Related papers (2023-06-04T01:59:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.