f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
- URL: http://arxiv.org/abs/2602.05946v2
- Date: Mon, 09 Feb 2026 18:34:18 GMT
- Title: f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
- Authors: Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song,
- Abstract summary: We propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL)<n>We provide theoretical guarantees that these classes of objectives improve the average reward after alignment.
- Score: 15.396104072574104
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.
Related papers
- From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning [7.6602542594279335]
We propose Reinforcement Learning with Relative Rewards to shift reward shaping from absolute scoring to relative ranking.<n>We show that RLRR yields consistent performance improvements over standard group-based baselines across reasoning benchmarks and open-ended generation tasks.
arXiv Detail & Related papers (2026-01-30T15:07:06Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - On the Hidden Objective Biases of Group-based Reinforcement Learning [40.50223796302703]
Group-based reinforcement learning methods are widely used nowadays to post-train large language models.<n>We present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation.
arXiv Detail & Related papers (2026-01-08T15:00:35Z) - A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs [2.840505903487544]
This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning environments.<n>We introduce a comprehensive evaluation framework that assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences.
arXiv Detail & Related papers (2025-12-09T16:39:32Z) - The Peril of Preference: Why GRPO fails on Ordinal Rewards [0.8937905773981699]
We introduce Correctness Relative Policy Optimization (CoRPO), a new formulation that solves this flaw.<n>CoRPO uses an adaptive baseline that enforces a minimum quality threshold, ensuring failed solutions are never positively reinforced.<n>We empirically validate CoRPO on a code verification task, where it demonstrates more stable convergence and better out-of-domain generalization.
arXiv Detail & Related papers (2025-11-06T15:12:50Z) - DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data [65.09939942413651]
We propose a principled extension to GRPO that addresses inter-group imbalance with two key innovations.<n> Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence.<n>Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value.
arXiv Detail & Related papers (2025-05-21T03:43:29Z) - GVPO: Group Variance Policy Optimization for Large Language Model Post-Training [19.005045649097987]
Group Variance Policy Optimization (GVPO) incorporates the analytical solution to KL-constrained reward directly into its weights.<n>GVPO offers two key advantages: it guarantees a unique optimal solution, exactly the KL-constrained reward objective, and it supports flexible sampling distributions.<n>By unifying theoretical guarantees with practical adaptability, GVPO establishes a new paradigm for reliable and versatile LLM post-training.
arXiv Detail & Related papers (2025-04-28T09:02:24Z) - Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [55.15106182268834]
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models.<n>It faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive.<n>We introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts.
arXiv Detail & Related papers (2025-04-18T17:49:55Z) - Accelerating Proximal Policy Optimization Learning Using Task Prediction for Solving Environments with Delayed Rewards [8.455772877963792]
We introduce two key enhancements to PPO: a hybrid policy architecture that combines an offline policy with an online PPO policy, and a reward shaping mechanism using Time Window Temporal Logic (TWTL)<n>We demonstrate the effectiveness of our approach through extensive experiments on an inverted pendulum and a lunar lander environments.
arXiv Detail & Related papers (2024-11-26T20:22:31Z) - PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning.
Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable.
Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Target-Embedding Autoencoders for Supervised Representation Learning [111.07204912245841]
This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional.
We motivate and formalize the general framework of target-embedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets.
arXiv Detail & Related papers (2020-01-23T02:37:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.