MAPO: Mixed Advantage Policy Optimization
- URL: http://arxiv.org/abs/2509.18849v3
- Date: Thu, 25 Sep 2025 01:02:44 GMT
- Title: MAPO: Mixed Advantage Policy Optimization
- Authors: Wenke Huang, Quan Zhang, Yiyang Fang, Jian Liang, Xuankun Rong, Huanjin Yao, Guancheng Wan, Ke Liang, Wenwen He, Mingjun Li, Leszek Rutkowski, Mang Ye, Bo Du, Dacheng Tao,
- Abstract summary: We propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO)<n>We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories.
- Score: 120.96975697212065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.
Related papers
- SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning [50.93295951454092]
We introduce a set level diversity objective defined over sampled trajectories using kernelized similarity.<n>Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization.<n>Experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
arXiv Detail & Related papers (2026-02-01T07:13:20Z) - AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards [60.2998874976509]
We propose advantage-weighted policy optimization (AWPO) to integrate explicit reasoning rewards to enhance tool-use capability.<n>AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals.<n>Experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks.
arXiv Detail & Related papers (2025-12-22T08:07:00Z) - APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport [37.21695864040979]
The reward model (RM) plays a crucial role in aligning Large Language Models (LLMs) with human preferences through Reinforcement Learning.<n>This paper introduces an effective enhancement to BT-based RMs through an adaptive margin mechanism.
arXiv Detail & Related papers (2025-10-13T03:13:28Z) - PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning [6.050409262589219]
We propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling.<n>Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts during training.<n>Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.
arXiv Detail & Related papers (2025-08-28T09:18:26Z) - Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO [51.22869332661607]
We decompose the performance gap between reinforcement learning from human feedback and direct preference optimization under a representation gap.<n>We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications.
arXiv Detail & Related papers (2025-05-26T09:54:02Z) - AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum [45.135858299101386]
Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs)<n>Group relative advantage estimation has attracted considerable attention for eliminating the dependency on the value model.<n>We propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimize the cross-entropy loss using advantages enhanced through a momentum-based estimation scheme.
arXiv Detail & Related papers (2025-05-20T12:13:44Z) - Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z) - Learning Dynamic Representations via An Optimally-Weighted Maximum Mean Discrepancy Optimization Framework for Continual Learning [16.10753846850319]
Continual learning allows models to persistently acquire and retain information.<n> catastrophic forgetting can severely impair model performance.<n>We introduce a novel framework termed Optimally-Weighted Mean Discrepancy (OWMMD), which imposes penalties on representation alterations.
arXiv Detail & Related papers (2025-01-21T13:33:45Z) - Meta-Learning Objectives for Preference Optimization [39.15940594751445]
We show that it is possible to gain insights on the efficacy of preference optimization algorithms on simpler benchmarks.<n>We propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO)
arXiv Detail & Related papers (2024-11-10T19:11:48Z) - Fine-Tuning Language Models with Advantage-Induced Policy Alignment [80.96507425217472]
We propose a novel algorithm for aligning large language models to human preferences.
We show that it consistently outperforms PPO in language tasks by a large margin.
We also provide a theoretical justification supporting the design of our loss function.
arXiv Detail & Related papers (2023-06-04T01:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.