Related papers: How RLHF Amplifies Sycophancy

How RLHF Amplifies Sycophancy

URL: http://arxiv.org/abs/2602.01002v1
Date: Sun, 01 Feb 2026 03:46:14 GMT
Title: How RLHF Amplifies Sycophancy
Authors: Itai Shapira, Gerdus Benade, Ariel D. Procaccia,
Abstract summary: Large language models often exhibit increased sycophantic behavior after preference-based post-training.<n>We identify an explicit amplification mechanism that causally links optimization against a learned reward to bias in the human preference data used for alignment.<n>We propose a training-time intervention designed to neutralize the amplification mechanism itself.
Score: 23.213056717401418
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models often exhibit increased sycophantic behavior after preference-based post-training, showing a stronger tendency to affirm a user's stated or implied belief even when this conflicts with factual accuracy or sound judgment. We present a formal analysis of how alignment from human feedback can increase this failure mode by identifying an explicit amplification mechanism that causally links optimization against a learned reward to bias in the human preference data used for alignment. We show that the direction of behavioral drift is determined by a covariance under the base policy between endorsing the belief signal in the prompt and the learned reward, and that the first-order effect reduces to a simple mean-gap condition. We then analyze reward learning from pairwise comparisons under random utility models like Bradley-Terry and characterize when bias in human annotators' preferences induces this reward gap. Next, we propose a training-time intervention designed to neutralize the amplification mechanism itself. Among all post-trained policies that prevent sycophantic behavior from increasing, we characterize the unique policy closest in KL divergence to the unconstrained post-trained policy, and derive the corresponding minimal reward correction as a closed-form agreement penalty. Computational experiments find that reward gaps are common and cause behavioral drift in all the configurations considered.

Related papers

Targeting Misalignment: A Conflict-Aware Framework for Reward-Model-based LLM Alignment [5.900494456937422]
Reward-model-based fine-tuning is a central paradigm in aligning Large Language Models with human preferences.<n>This paper investigates a novel framework to identify and mitigate such misalignment by treating the fine-tuning process as a form of knowledge integration.
arXiv Detail & Related papers (2025-12-10T00:52:21Z)
Rectifying Shortcut Behaviors in Preference-based Reward Learning [46.09046818725698]
In reinforcement learning, preference-based reward models play a central role in aligning large language models to human-aligned behavior.<n>Recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization.<n>We introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning.
arXiv Detail & Related papers (2025-10-21T20:08:32Z)
A Principled Loss Function for Direct Language Model Alignment [0.0]
We propose a novel loss function derived directly from the RLHF optimality condition.<n>Our proposed loss targets a specific finite value for the logits, which is dictated by the underlying reward, rather than its difference.<n>This inherent stability prevents reward hacking and leads to more effective alignment.
arXiv Detail & Related papers (2025-08-10T01:56:58Z)
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization [60.176008034221404]
Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences.<n>Prior work has observed that the likelihood of preferred responses often decreases during training.<n>We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning.
arXiv Detail & Related papers (2024-10-11T14:22:44Z)
Sequential Manipulation Against Rank Aggregation: Theory and Algorithm [119.57122943187086]
We leverage an online attack on the vulnerable data collection process. From the game-theoretic perspective, the confrontation scenario is formulated as a distributionally robust game. The proposed method manipulates the results of rank aggregation methods in a sequential manner.
arXiv Detail & Related papers (2024-07-02T03:31:21Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
STRAPPER: Preference-based Reinforcement Learning via Self-training Augmentation and Peer Regularization [18.811470043767713]
Preference-based reinforcement learning (PbRL) promises to learn a complex reward function with binary human preference. We present a self-training method along with our proposed peer regularization, which penalizes the reward model memorizing uninformative labels and acquires confident predictions.
arXiv Detail & Related papers (2023-07-19T00:31:58Z)
Ensembling over Classifiers: a Bias-Variance Perspective [13.006468721874372]
We build upon the extension to the bias-variance decomposition by Pfau (2013) in order to gain crucial insights into the behavior of ensembles of classifiers. We show that conditional estimates necessarily incur an irreducible error. Empirically, standard ensembling reducesthe bias, leading us to hypothesize that ensembles of classifiers may perform well in part because of this unexpected reduction.
arXiv Detail & Related papers (2022-06-21T17:46:35Z)
Benign Overfitting in Adversarially Robust Linear Classification [91.42259226639837]
"Benign overfitting", where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community. We show that benign overfitting indeed occurs in adversarial training, a principled approach to defend against adversarial examples.
arXiv Detail & Related papers (2021-12-31T00:27:31Z)
False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z)
Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs. We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z)
Deconfounding Scores: Feature Representations for Causal Effect Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation. We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data. In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.