Related papers: Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

URL: http://arxiv.org/abs/2602.21765v1
Date: Wed, 25 Feb 2026 10:36:17 GMT
Title: Generalisation of RLHF under Reward Shift and Clipped KL Regularisation
Authors: Kenton Tang, Yuzhu Chen, Fengxiang He,
Abstract summary: We develop generalisation theory for reinforcement learning from human feedback (RLHF)<n>We account for (1) emphreward shift: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts.<n>We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL clipping error.<n>The theory yields practical implications in (1) optimal KL clipping threshold, and (2) budget allocation in prompts, rollouts, and preference data.
Score: 20.456598402422813
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF that explicitly accounts for (1) \emph{reward shift}: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts; and (2) \emph{clipped KL regularisation}: the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF. We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL clipping error. We also discuss special cases of (1) initialising RLHF parameters with a uniform prior over a finite space, and (2) training RLHF by stochastic gradient descent, as an Ornstein-Uhlenbeck process. The theory yields practical implications in (1) optimal KL clipping threshold, and (2) budget allocation in prompts, rollouts, and preference data.

Related papers

Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z)
The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward [57.56453588632619]
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance.<n>This is often accompanied by catastrophic forgetting, where models lose previously acquired skills.<n>We argue that standard RLVR objectives lack a crucial mechanism for knowledge retention.
arXiv Detail & Related papers (2025-09-09T06:34:32Z)
Logarithmic Regret for Online KL-Regularized Reinforcement Learning [51.113248212150964]
KL-regularization plays a pivotal role in improving efficiency of RL fine-tuning for large language models.<n>Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored.<n>We propose an optimistic-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret.
arXiv Detail & Related papers (2025-02-11T11:11:05Z)
Sharp Analysis for KL-Regularized Contextual Bandits and RLHF [52.519416266840814]
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning.<n>We show that a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient.<n>Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms.
arXiv Detail & Related papers (2024-11-07T11:22:46Z)
UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function [14.7365465149829]
We propose textbfUNified textbfAlignment (UNA) which unifies RLHF/PPO, DPO and KTO.<n>With this novel mapping between a reward model and an optimal policy, UNA can 1.<n>outperform RLHF/PPO while simplify, stabilize, speed up and reduce memory burden of RL fine-tuning process.
arXiv Detail & Related papers (2024-08-27T18:04:07Z)
On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization [39.29350451006295]
preference matching (PM) RLHF is a novel approach that aligns large language models with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model.<n>Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses.<n>Experiments show a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF.
arXiv Detail & Related papers (2024-05-26T07:00:05Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
RRHF: Rank Responses to Align Language Models with Human Feedback without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO) We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z)
RL with KL penalties is better viewed as Bayesian inference [4.473139775790299]
We analyze challenges associated with treating a language model as anReinforcement Learning policy. We show how avoiding those challenges requires moving beyond the RL paradigm.
arXiv Detail & Related papers (2022-05-23T12:47:13Z)
Leverage the Average: an Analysis of KL Regularization in RL [44.01222241795292]
We show that Kullback-Leibler (KL) regularization implicitly averages q-values. We provide a very strong performance bound, the very first to combine two desirable aspects. Some of our assumptions do not hold with neural networks, so we complement this theoretical analysis with an extensive empirical study.
arXiv Detail & Related papers (2020-03-31T10:55:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.