Towards Understanding Sycophancy in Language Models
- URL: http://arxiv.org/abs/2310.13548v3
- Date: Fri, 27 Oct 2023 17:45:26 GMT
- Title: Towards Understanding Sycophancy in Language Models
- Authors: Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda
Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds,
Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal
Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez
- Abstract summary: We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback.
We show that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks.
Our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
- Score: 49.99654432561934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human feedback is commonly utilized to finetune AI assistants. But human
feedback may also encourage model responses that match user beliefs over
truthful ones, a behaviour known as sycophancy. We investigate the prevalence
of sycophancy in models whose finetuning procedure made use of human feedback,
and the potential role of human preference judgments in such behavior. We first
demonstrate that five state-of-the-art AI assistants consistently exhibit
sycophancy across four varied free-form text-generation tasks. To understand if
human preferences drive this broadly observed behavior, we analyze existing
human preference data. We find that when a response matches a user's views, it
is more likely to be preferred. Moreover, both humans and preference models
(PMs) prefer convincingly-written sycophantic responses over correct ones a
non-negligible fraction of the time. Optimizing model outputs against PMs also
sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results
indicate that sycophancy is a general behavior of state-of-the-art AI
assistants, likely driven in part by human preference judgments favoring
sycophantic responses.
Related papers
- Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour [0.8133739801185272]
We show that Large Language Models (LLMs) show sycophantic tendencies when responding to queries involving subjective opinions and statements.
LLMs at various scales seem not to follow the users' hints by demonstrating confidence in delivering the correct answers.
arXiv Detail & Related papers (2023-11-15T22:18:33Z) - Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning
from Human Feedback [55.78118035358662]
Reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values.
We have identified that the reward model often finds shortcuts to bypass its intended objectives.
We propose an innovative solution, applying the Product-of-Experts technique to separate reward modeling from the influence of sequence length.
arXiv Detail & Related papers (2023-10-08T15:14:39Z) - Human Feedback is not Gold Standard [28.63384327791185]
We critically analyse the use of human feedback for both training and evaluation.
We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality.
arXiv Detail & Related papers (2023-09-28T11:18:20Z) - Simple synthetic data reduces sycophancy in large language models [88.4435858554904]
We study the prevalence of sycophancy in language models.
Sycophancy is where models tailor their responses to follow a human user's view even when that view is not objectively correct.
arXiv Detail & Related papers (2023-08-07T23:48:36Z) - Provable Benefits of Policy Learning from Human Preferences in
Contextual Bandit Problems [82.92678837778358]
preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT.
We show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches.
arXiv Detail & Related papers (2023-07-24T17:50:24Z) - Alignment with human representations supports robust few-shot learning [14.918671859247429]
We show there should be a U-shaped relationship between the degree of representational alignment with humans and performance on few-shot learning tasks.
We also show that highly-aligned models are more robust to both natural adversarial attacks and domain shifts.
Our results suggest that human-alignment is often a sufficient, but not necessary, condition for models to make effective use of limited data, be robust, and generalize well.
arXiv Detail & Related papers (2023-01-27T21:03:19Z) - The Effect of Modeling Human Rationality Level on Learning Rewards from
Multiple Feedback Types [38.37216644899506]
We argue that grounding the rationality coefficient in real data for each feedback type has a significant positive effect on reward learning.
We find that when learning from a single feedback type, overestimating human rationality can have dire effects on reward accuracy and regret.
arXiv Detail & Related papers (2022-08-23T02:19:10Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.