Related papers: When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

URL: http://arxiv.org/abs/2508.02087v2
Date: Tue, 05 Aug 2025 04:26:47 GMT
Title: When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models
Authors: Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, Di Wang,
Abstract summary: We study how user opinions induce sycophancy across different model families.<n>First-person prompts consistently induce higher sycophancy rates than third-person framings.<n>These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers.
Score: 11.001042171551566
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (``I believe...'') consistently induce higher sycophancy rates than third-person framings (``They believe...'') by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.

Related papers

Ask don't tell: Reducing sycophancy in large language models [1.5701458173528275]
We show that sycophancy is substantially higher in response to non-questions compared to questions.<n>We find that asking a model to convert non-questions into questions before answering significantly reduces sycophancy.
arXiv Detail & Related papers (2026-02-27T12:27:04Z)
Disentangling Deception and Hallucination Failures in LLMs [7.906722750233381]
We propose an internal, mechanism-oriented perspective that separates Knowledge Existence from Behavior Expression.<n> hallucination and deception correspond to two qualitatively different failure modes that may appear similar at the output level but differ in their underlying mechanisms.<n>We analyze these failure modes through representation separability, sparse interpretability, and inference-time activation steering.
arXiv Detail & Related papers (2026-02-16T07:36:49Z)
Not Your Typical Sycophant: The Elusive Nature of Sycophancy in Large Language Models [2.1700203922407493]
We propose a novel way to evaluate sycophancy of LLMs in a direct and neutral way.<n>A key novelty is the use of LLM-as-a-judge, evaluation of sycophancy as a zero-sum game in a bet setting.
arXiv Detail & Related papers (2026-01-21T20:00:14Z)
Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations [70.43616821802249]
Large language models (LLMs) frequently generate hallucinations.<n>Previous work shows that their internal states encode rich signals of truthfulness.<n>This paper shows that truthfulness cues arise from two distinct information pathways.
arXiv Detail & Related papers (2026-01-12T11:10:43Z)
Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models [4.946483489399819]
Large Language Models (LLMs) are prone to hallucination, the generation of factually incorrect statements.<n>This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.
arXiv Detail & Related papers (2025-10-07T16:40:31Z)
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models [57.834711966432685]
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value.<n>We introduce the Bullshit Index, a novel metric quantifying large language model's indifference to truth.<n>We observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy.
arXiv Detail & Related papers (2025-07-10T07:11:57Z)
Investigating VLM Hallucination from a Cognitive Psychology Perspective: A First Step Toward Interpretation with Intriguing Observations [60.63340688538124]
Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs)<n>Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations.<n>In this work, we introduce a psychological taxonomy, categorizing VLMs' cognitive biases that lead to hallucinations, including sycophancy, logical inconsistency, and a newly identified VLMs behaviour: appeal to authority.
arXiv Detail & Related papers (2025-07-03T19:03:16Z)
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers [76.42159902257677]
We argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR)<n>OCR drives both generalization and hallucination, depending on whether the associated concepts are causally related.<n>Our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.
arXiv Detail & Related papers (2025-06-12T16:50:45Z)
Measuring Sycophancy of Language Models in Multi-turn Dialogues [15.487521707039772]
We introduce SYCON Bench, a novel benchmark for evaluating sycophancy in multi-turn, free-form conversational settings.<n>Applying SYCON Bench to 17 Large Language Models across three real-world scenarios, we find that sycophancy remains a prevalent failure mode.
arXiv Detail & Related papers (2025-05-28T14:05:46Z)
The Mirage of Multimodality: Where Truth is Tested and Honesty Unravels [22.497467057872377]
This study is the first systematic investigation of distortions associated with System I and System II reasoning in multimodal contexts.<n>We demonstrate that slower reasoning models, when presented with incomplete or misleading visual inputs, are more likely to fabricate plausible yet false details to support flawed reasoning.
arXiv Detail & Related papers (2025-05-26T16:55:38Z)
Aligned Probing: Relating Toxic Behavior and Model Internals [66.49887503194101]
We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs)<n>Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time.<n>Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers.
arXiv Detail & Related papers (2025-03-17T17:23:50Z)
Sycophancy in Large Language Models: Causes and Mitigations [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. Their tendency to exhibit sycophantic behavior poses significant risks to their reliability and ethical deployment. This paper provides a technical survey of sycophancy in LLMs, analyzing its causes, impacts, and potential mitigation strategies.
arXiv Detail & Related papers (2024-11-22T16:56:49Z)
Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework [18.54098084470481]
We analyze sycophancy across vision-language benchmarks and propose an inference-time mitigation framework.<n>Our framework effectively mitigates sycophancy across all evaluated models, while maintaining performance on neutral prompts.
arXiv Detail & Related papers (2024-08-21T01:03:21Z)
Toward A Causal Framework for Modeling Perception [22.596961524387233]
Perception remains understudied in machine learning (ML)<n>We present a first approach to modeling perception causally.<n>We define two kinds of probabilistic causal perception: structural perception and parametrical perception.
arXiv Detail & Related papers (2024-01-24T12:08:58Z)
When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour [0.8133739801185272]
We study the suggestibility of Large Language Models to sycophantic behaviour.<n>This behaviour is known as sycophancy and depicts the tendency of LLMs to generate misleading responses.
arXiv Detail & Related papers (2023-11-15T22:18:33Z)
Interpretable Imitation Learning with Dynamic Causal Relations [65.18456572421702]
We propose to expose captured knowledge in the form of a directed acyclic causal graph. We also design this causal discovery process to be state-dependent, enabling it to model the dynamics in latent causal graphs. The proposed framework is composed of three parts: a dynamic causal discovery module, a causality encoding module, and a prediction module, and is trained in an end-to-end manner.
arXiv Detail & Related papers (2023-09-30T20:59:42Z)
Simple synthetic data reduces sycophancy in large language models [88.4435858554904]
We study the prevalence of sycophancy in language models. Sycophancy is where models tailor their responses to follow a human user's view even when that view is not objectively correct.
arXiv Detail & Related papers (2023-08-07T23:48:36Z)
Causal Triplet: An Open Challenge for Intervention-centric Causal Representation Learning [98.78136504619539]
Causal Triplet is a causal representation learning benchmark featuring visually more complex scenes. We show that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts.
arXiv Detail & Related papers (2023-01-12T17:43:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.