GAPO: Group Adaptive Policy Optimization for Real-World Code Edit
- URL: http://arxiv.org/abs/2510.21830v1
- Date: Wed, 22 Oct 2025 03:37:49 GMT
- Title: GAPO: Group Adaptive Policy Optimization for Real-World Code Edit
- Authors: Jianqing Zhang, Zhezheng Hao, Wei Xia, Hande Dong, Hong Wang, Chenxing Wei, Yuyan Zhou, Yubin Qi, Qiang Lin, Jian Cao,
- Abstract summary: Group Adaptive Policy Optimization (GAPO) finds an outlier-free highest-density interval (HDI) per prompt and uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation.<n>GAPO robustly handles skewed distributions while remaining plug-and-play and efficient.<n>We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks.
- Score: 18.191276089029607
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable outliers, leading to distorted advantage computation and increased noise. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an outlier-free highest-density interval (HDI) per prompt and then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation. This adaptive Q robustly handles skewed distributions while remaining plug-and-play and efficient. We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO. Code is publicly available.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization [97.18886232580131]
Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration.<n>We propose Turn-Level GRPO, a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization.
arXiv Detail & Related papers (2026-01-23T06:21:33Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - OptPO: Optimal Rollout Allocation for Test-time Policy Optimization [11.375209834858135]
Test-time policy optimization enables large language models to adapt to distribution shifts by leveraging feedback from self-generated rollouts.<n>We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets.
arXiv Detail & Related papers (2025-12-02T15:38:52Z) - Reinforcing Diffusion Models by Direct Group Preference Optimization [19.195805549362074]
Group Preference Optimization (DGPO) learns directly from group-level preferences, which utilize relative information of samples within groups.<n>Results show that DGPO trains around 20 times faster than existing state-of-the-art methods and superior performance on both in-domain and out-of-domain metrics reward.
arXiv Detail & Related papers (2025-10-09T16:40:43Z) - Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes [55.2480439325792]
Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics.<n>Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with outcomes, like scientific experiments.
arXiv Detail & Related papers (2025-08-15T20:50:53Z) - Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward [10.640867597958863]
We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefixes via a Shared-Prefix Forward strategy.<n>By restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once.<n>We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO.
arXiv Detail & Related papers (2025-06-05T09:13:37Z) - DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data [65.09939942413651]
We propose a principled extension to GRPO that addresses inter-group imbalance with two key innovations.<n> Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence.<n>Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value.
arXiv Detail & Related papers (2025-05-21T03:43:29Z) - Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning [11.157278744897427]
Group Relative Policy Optimization ( GRPO) was proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group.<n>We show that using a more adaptive advantage estimation model, KRPO can improve the stability and performance of GRPO.
arXiv Detail & Related papers (2025-05-12T13:09:49Z) - AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.<n>It balances the policy model and the reference model to achieve personalized reward margins.<n>It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - Better Practices for Domain Adaptation [62.70267990659201]
Domain adaptation (DA) aims to provide frameworks for adapting models to deployment data without using labels.
Unclear validation protocol for DA has led to bad practices in the literature.
We show challenges across all three branches of domain adaptation methodology.
arXiv Detail & Related papers (2023-09-07T17:44:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.