Not Your Typical Sycophant: The Elusive Nature of Sycophancy in Large Language Models
- URL: http://arxiv.org/abs/2601.15436v2
- Date: Mon, 26 Jan 2026 16:45:31 GMT
- Title: Not Your Typical Sycophant: The Elusive Nature of Sycophancy in Large Language Models
- Authors: Shahar Ben Natan, Oren Tsur,
- Abstract summary: We propose a novel way to evaluate sycophancy of LLMs in a direct and neutral way.<n>A key novelty is the use of LLM-as-a-judge, evaluation of sycophancy as a zero-sum game in a bet setting.
- Score: 2.1700203922407493
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose a novel way to evaluate sycophancy of LLMs in a direct and neutral way, mitigating various forms of uncontrolled bias, noise, or manipulative language, deliberately injected to prompts in prior works. A key novelty in our approach is the use of LLM-as-a-judge, evaluation of sycophancy as a zero-sum game in a bet setting. Under this framework, sycophancy serves one individual (the user) while explicitly incurring cost on another. Comparing four leading models - Gemini 2.5 Pro, ChatGpt 4o, Mistral-Large-Instruct-2411, and Claude Sonnet 3.7 - we find that while all models exhibit sycophantic tendencies in the common setting, in which sycophancy is self-serving to the user and incurs no cost on others, Claude and Mistral exhibit "moral remorse" and over-compensate for their sycophancy in case it explicitly harms a third party. Additionally, we observed that all models are biased toward the answer proposed last. Crucially, we find that these two phenomena are not independent; sycophancy and recency bias interact to produce `constructive interference' effect, where the tendency to agree with the user is exacerbated when the user's opinion is presented last.
Related papers
- Ask don't tell: Reducing sycophancy in large language models [1.5701458173528275]
We show that sycophancy is substantially higher in response to non-questions compared to questions.<n>We find that asking a model to convert non-questions into questions before answering significantly reduces sycophancy.
arXiv Detail & Related papers (2026-02-27T12:27:04Z) - Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians [47.64440749179653]
We show that even an idealized Bayes-rational user is vulnerable to delusional spiraling.<n>This effect persists in the face of two candidate mitigations.
arXiv Detail & Related papers (2026-02-22T12:13:44Z) - When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models [11.001042171551566]
We study how user opinions induce sycophancy across different model families.<n>First-person prompts consistently induce higher sycophancy rates than third-person framings.<n>These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers.
arXiv Detail & Related papers (2025-08-04T05:55:06Z) - One Token to Fool LLM-as-a-Judge [52.45386385722788]
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models.<n>We uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking.
arXiv Detail & Related papers (2025-07-11T17:55:22Z) - Measuring Sycophancy of Language Models in Multi-turn Dialogues [33.875038658886986]
We introduce SYCON Bench, a novel benchmark for evaluating sycophancy in multi-turn, free-form conversational settings.<n>Applying SYCON Bench to 17 Large Language Models across three real-world scenarios, we find that sycophancy remains a prevalent failure mode.
arXiv Detail & Related papers (2025-05-28T14:05:46Z) - ELEPHANT: Measuring and understanding social sycophancy in LLMs [31.88430788417527]
We introduce social sycophancy, characterizing sycophancy as excessive preservation of a user's face.<n>Applying our benchmark to 11 models, we show that LLMs consistently exhibit high rates of social sycophancy.
arXiv Detail & Related papers (2025-05-20T06:45:17Z) - Where Fact Ends and Fairness Begins: Redefining AI Bias Evaluation through Cognitive Biases [77.3489598315447]
We argue that identifying the boundary between fact and fair is essential for meaningful fairness evaluation.<n>We introduce Fact-or-Fair, a benchmark with (i) objective queries aligned with descriptive, fact-based judgments, and (ii) subjective queries aligned with normative, fairness-based judgments.
arXiv Detail & Related papers (2025-02-09T10:54:11Z) - Self-Debiasing Large Language Models: Zero-Shot Recognition and
Reduction of Stereotypes [73.12947922129261]
We leverage the zero-shot capabilities of large language models to reduce stereotyping.
We show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups.
We hope this work opens inquiry into other zero-shot techniques for bias mitigation.
arXiv Detail & Related papers (2024-02-03T01:40:11Z) - KTO: Model Alignment as Prospect Theoretic Optimization [67.44320255397506]
Kahneman & Tversky's $textitprospect theory$ tells us that humans perceive random variables in a biased but well-defined manner.
We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases.
We propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences.
arXiv Detail & Related papers (2024-02-02T10:53:36Z) - Towards Understanding Sycophancy in Language Models [49.352840825419236]
We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback.<n>We show that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks.<n>Our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
arXiv Detail & Related papers (2023-10-20T14:46:48Z) - Simple synthetic data reduces sycophancy in large language models [88.4435858554904]
We study the prevalence of sycophancy in language models.
Sycophancy is where models tailor their responses to follow a human user's view even when that view is not objectively correct.
arXiv Detail & Related papers (2023-08-07T23:48:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.