Related papers: Not Your Typical Sycophant: The Elusive Nature of Sycophancy in Large Language Models

Not Your Typical Sycophant: The Elusive Nature of Sycophancy in Large Language Models

URL: http://arxiv.org/abs/2601.15436v2
Date: Mon, 26 Jan 2026 16:45:31 GMT
Title: Not Your Typical Sycophant: The Elusive Nature of Sycophancy in Large Language Models
Authors: Shahar Ben Natan, Oren Tsur,
Abstract summary: We propose a novel way to evaluate sycophancy of LLMs in a direct and neutral way.<n>A key novelty is the use of LLM-as-a-judge, evaluation of sycophancy as a zero-sum game in a bet setting.
Score: 2.1700203922407493
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We propose a novel way to evaluate sycophancy of LLMs in a direct and neutral way, mitigating various forms of uncontrolled bias, noise, or manipulative language, deliberately injected to prompts in prior works. A key novelty in our approach is the use of LLM-as-a-judge, evaluation of sycophancy as a zero-sum game in a bet setting. Under this framework, sycophancy serves one individual (the user) while explicitly incurring cost on another. Comparing four leading models - Gemini 2.5 Pro, ChatGpt 4o, Mistral-Large-Instruct-2411, and Claude Sonnet 3.7 - we find that while all models exhibit sycophantic tendencies in the common setting, in which sycophancy is self-serving to the user and incurs no cost on others, Claude and Mistral exhibit "moral remorse" and over-compensate for their sycophancy in case it explicitly harms a third party. Additionally, we observed that all models are biased toward the answer proposed last. Crucially, we find that these two phenomena are not independent; sycophancy and recency bias interact to produce `constructive interference' effect, where the tendency to agree with the user is exacerbated when the user's opinion is presented last.

Related papers

Ask don't tell: Reducing sycophancy in large language models [1.5701458173528275]
We show that sycophancy is substantially higher in response to non-questions compared to questions.<n>We find that asking a model to convert non-questions into questions before answering significantly reduces sycophancy.
arXiv Detail & Related papers (2026-02-27T12:27:04Z)
Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians [47.64440749179653]
We show that even an idealized Bayes-rational user is vulnerable to delusional spiraling.<n>This effect persists in the face of two candidate mitigations.
arXiv Detail & Related papers (2026-02-22T12:13:44Z)
When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models [11.001042171551566]
We study how user opinions induce sycophancy across different model families.<n>First-person prompts consistently induce higher sycophancy rates than third-person framings.<n>These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers.
arXiv Detail & Related papers (2025-08-04T05:55:06Z)
One Token to Fool LLM-as-a-Judge [52.45386385722788]
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models.<n>We uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking.
arXiv Detail & Related papers (2025-07-11T17:55:22Z)
Measuring Sycophancy of Language Models in Multi-turn Dialogues [33.875038658886986]
We introduce SYCON Bench, a novel benchmark for evaluating sycophancy in multi-turn, free-form conversational settings.<n>Applying SYCON Bench to 17 Large Language Models across three real-world scenarios, we find that sycophancy remains a prevalent failure mode.
arXiv Detail & Related papers (2025-05-28T14:05:46Z)
ELEPHANT: Measuring and understanding social sycophancy in LLMs [31.88430788417527]
We introduce social sycophancy, characterizing sycophancy as excessive preservation of a user's face.<n>Applying our benchmark to 11 models, we show that LLMs consistently exhibit high rates of social sycophancy.
arXiv Detail & Related papers (2025-05-20T06:45:17Z)
Where Fact Ends and Fairness Begins: Redefining AI Bias Evaluation through Cognitive Biases [77.3489598315447]
We argue that identifying the boundary between fact and fair is essential for meaningful fairness evaluation.<n>We introduce Fact-or-Fair, a benchmark with (i) objective queries aligned with descriptive, fact-based judgments, and (ii) subjective queries aligned with normative, fairness-based judgments.
arXiv Detail & Related papers (2025-02-09T10:54:11Z)
Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping. We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups. We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z)
KTO: Model Alignment as Prospect Theoretic Optimization [67.44320255397506]
Kahneman & Tversky's $textitprospect theory$ tells us that humans perceive random variables in a biased but well-defined manner. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases. We propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences.
arXiv Detail & Related papers (2024-02-02T10:53:36Z)
Towards Understanding Sycophancy in Language Models [49.352840825419236]
We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback.<n>We show that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks.<n>Our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
arXiv Detail & Related papers (2023-10-20T14:46:48Z)
Simple synthetic data reduces sycophancy in large language models [88.4435858554904]
We study the prevalence of sycophancy in language models. Sycophancy is where models tailor their responses to follow a human user's view even when that view is not objectively correct.
arXiv Detail & Related papers (2023-08-07T23:48:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.