Related papers: What is the Alignment Objective of GRPO?

What is the Alignment Objective of GRPO?

URL: http://arxiv.org/abs/2502.18548v3
Date: Thu, 13 Mar 2025 16:48:34 GMT
Title: What is the Alignment Objective of GRPO?
Authors: Milan Vojnovic, Se-Young Yun,
Abstract summary: We present a framework that enables us to characterise the stationary policies of the GRPO algorithm.<n>The precise form of preference aggregation arises from the way the reward preference model is defined and from the penalty function.<n>We provide explicit characterisations of the aggregate preference for binary questions, for groups of size two, and in the limit of large group size.
Score: 30.36318490634376
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this note, we examine the aggregation of preferences achieved by the Group Policy Optimisation (GRPO) algorithm, a reinforcement learning method used to train advanced artificial intelligence models such as DeepSeek-R1-Zero and DeepSeekMath. The GRPO algorithm trains a policy using a reward preference model, which is computed by sampling a set of outputs for a given context, observing the corresponding rewards, and applying shift-and-scale normalisation to these reward values. Additionally, it incorporates a penalty function to discourage deviations from a reference policy. We present a framework that enables us to characterise the stationary policies of the GRPO algorithm. This analysis reveals that the aggregation of preferences differs fundamentally from standard logarithmic pooling, which is implemented by other approaches such as RLHF. The precise form of preference aggregation arises from the way the reward preference model is defined and from the penalty function, which we show to essentially correspond to the reverse Kullback-Leibler (KL) divergence between the aggregation policy and the reference policy. Interestingly, we demonstrate that for groups of size two, the reward preference model corresponds to pairwise comparison preferences, similar to those in other alignment methods based on pairwise comparison feedback. We provide explicit characterisations of the aggregate preference for binary questions, for groups of size two, and in the limit of large group size. This provides insights into the dependence of the aggregate preference on parameters such as the regularisation constant and the confidence margin of question answers. Finally, we discuss the aggregation of preferences obtained by modifying the GRPO algorithm to use direct KL divergence as the penalty or to use rewards without scale normalisation.

Related papers

Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment [30.266966684932186]
We propose a semantic-aware regularization for the RLHF framework based on the entropy-regularized Wasserstein distance.<n>Our method outperforms KL- and $f$divergence-based baselines.
arXiv Detail & Related papers (2026-02-02T05:56:16Z)
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z)
Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model [43.74350307533018]
We study policy alignment to preferences under an unknown and unrestricted complexity.<n>We use first-order optimization suited to neural networks and batched data.
arXiv Detail & Related papers (2025-12-26T08:22:41Z)
Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment [61.80228667422234]
VGPO redefines value estimation across both temporal and group dimensions.<n>It transforms the sparse terminal reward into dense, process-aware value estimates.<n>It replaces standard group normalization with a novel process enhanced by absolute values to maintain a stable optimization signal.
arXiv Detail & Related papers (2025-12-13T16:31:26Z)
Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models [48.3520220561093]
Group Relative Policy Optimization has shown promise in aligning image and video generative models with human preferences.<n>Applying it to modern flow matching models is challenging because of its deterministic sampling paradigm.<n>We propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs.
arXiv Detail & Related papers (2025-11-21T05:02:47Z)
GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning [52.16150076582931]
We propose Group Relative Policy Optimization for Representation Model (GRPO-RM)<n>Our method establishes a predefined output set to functionally replace token sequence sampling in large language models (LLMs)<n>A specialized reward function is designed to accommodate the properties of representation models.
arXiv Detail & Related papers (2025-11-19T09:19:39Z)
On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence [2.8165669455824696]
Group Relative Policy Optimization is a critic-free reinforcement learning algorithm.<n>We show that GRPO update rule estimates the policy gradient at the old policy rather than the current one.<n>We propose a new algorithm: Trajectory level Importance Corrected GRPO.
arXiv Detail & Related papers (2025-08-04T19:01:19Z)
Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning [11.708197376569016]
Group Relative Policy Optimization ( GRPO) is proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group.<n>It can lead to inaccurate advantage estimates in environments with highly noisy rewards, potentially introducing bias.<n>We propose a model, called Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), by using lightweight Kalman filtering to dynamically estimate the latent reward mean and variance.
arXiv Detail & Related papers (2025-05-12T13:09:49Z)
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.<n>Current approaches typically address this issue through online sampling from the target policy.<n>We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z)
SePPO: Semi-Policy Preference Optimization for Diffusion Alignment [67.8738082040299]
We propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. We validate SePPO across both text-to-image and text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference [17.76565371753346]
This paper develops two RLHF algorithms without reward inference. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
arXiv Detail & Related papers (2024-09-25T22:20:11Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. We increase the consistency and informativeness of the pairwise preference signals through targeted modifications. We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
ROPO: Robust Preference Optimization for Large Language Models [59.10763211091664]
We propose an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models. Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods.
arXiv Detail & Related papers (2024-04-05T13:58:51Z)
BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback [30.894025833141537]
High variance of the gradient estimate is the primary reason for the lack of success of these methods. We generalize the target distribution in DPG, GDC and DPO by using Bayes' rule to define the reward-conditioned posterior. The resulting approach, referred to as BRAIn, significantly outperforms prior art in summarization and Antropic HH tasks.
arXiv Detail & Related papers (2024-02-04T13:16:29Z)
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable. Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z)
Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback. We consider the general reward setting where the reward can be defined over the whole trajectory. We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z)
Contextual bandits with concave rewards, and an application to fair ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR) We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z)
Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning [17.916366827429034]
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions. We propose an Anchor-changing Regularized Natural Policy Gradient framework, which can incorporate ideas from well-performing first-order methods.
arXiv Detail & Related papers (2022-06-10T21:09:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.