GOPO: Policy Optimization using Ranked Rewards
- URL: http://arxiv.org/abs/2602.03876v1
- Date: Sun, 01 Feb 2026 22:07:11 GMT
- Title: GOPO: Policy Optimization using Ranked Rewards
- Authors: Kyuseong Choi, Dwaipayan Saha, Woojeong Kim, Anish Agarwal, Raaz Dwivedi,
- Abstract summary: Group Ordinal Policy Optimization (GOPO) uses only the ranking of the rewards and discards their magnitudes.<n>We demonstrate consistent improvements across a range of tasks and model sizes.
- Score: 12.100854296428524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved LLM-as-judge evaluations across most intermediate training steps, and (3) reaching a policy of comparable quality in substantially less training steps than GRPO. We demonstrate consistent improvements across a range of tasks and model sizes.
Related papers
- Soft Sequence Policy Optimization [0.0]
We introduce Soft Sequence Policy Optimization (SSPO) as an off-policy reinforcement learning objective.<n>SSPO incorporates soft gating functions over token-level probability ratios within sequence-level importance weights.<n>We show that SSPO improves training stability and performance in mathematical reasoning tasks.
arXiv Detail & Related papers (2026-02-22T20:21:00Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z) - On Symmetric Losses for Robust Policy Optimization with Noisy Preferences [55.8615920580824]
This work focuses on reward modeling, a core component in reinforcement learning from human feedback.<n>We propose a principled framework for robust policy optimization under noisy preferences.<n>We prove that symmetric losses enable successful policy optimization even under noisy labels.
arXiv Detail & Related papers (2025-05-30T15:30:43Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - GVPO: Group Variance Policy Optimization for Large Language Model Post-Training [19.005045649097987]
Group Variance Policy Optimization (GVPO) incorporates the analytical solution to KL-constrained reward directly into its weights.<n>GVPO offers two key advantages: it guarantees a unique optimal solution, exactly the KL-constrained reward objective, and it supports flexible sampling distributions.<n>By unifying theoretical guarantees with practical adaptability, GVPO establishes a new paradigm for reliable and versatile LLM post-training.
arXiv Detail & Related papers (2025-04-28T09:02:24Z) - Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences.
Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function.
We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z) - Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [43.77763433288893]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data.<n>We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient.<n>We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation [46.61909578101735]
Adversarial Policy Optimization (AdvPO) is a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback.
In this paper, we introduce a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model.
arXiv Detail & Related papers (2024-03-08T09:20:12Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.