Ask don't tell: Reducing sycophancy in large language models
- URL: http://arxiv.org/abs/2602.23971v1
- Date: Fri, 27 Feb 2026 12:27:04 GMT
- Title: Ask don't tell: Reducing sycophancy in large language models
- Authors: Magda Dubois, Cozmin Ududec, Christopher Summerfield, Lennart Luettgau,
- Abstract summary: We show that sycophancy is substantially higher in response to non-questions compared to questions.<n>We find that asking a model to convert non-questions into questions before answering significantly reduces sycophancy.
- Score: 1.5701458173528275
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Sycophancy, the tendency of large language models to favour user-affirming responses over critical engagement, has been identified as an alignment failure, particularly in high-stakes advisory and social contexts. While prior work has documented conversational features correlated with sycophancy, we lack a systematic understanding of what provokes or prevents AI sycophancy. Here, we present a set of controlled experimental studies where we first isolate how input framing influences sycophancy, and second, leverage these findings to develop mitigation strategies. In a nested factorial design, we compare questions to various non-questions where we vary three orthogonal factors: epistemic certainty (statement, belief, conviction), perspective (I- vs user-perspective), and affirmation vs negation. We show that (1) sycophancy is substantially higher in response to non-questions compared to questions. Additionally, we find that (2) sycophancy increases monotonically with epistemic certainty conveyed by the user, and (3) is amplified by I-perspective framing. Building on this, we show that asking a model to convert non-questions into questions before answering significantly reduces sycophancy. Importantly, this effect is stronger than a simple baseline prompt asking models "not to be sycophantic". Our work offers a practical and effective input-level mitigation that both developers and users can easily adopt.
Related papers
- When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models [11.001042171551566]
We study how user opinions induce sycophancy across different model families.<n>First-person prompts consistently induce higher sycophancy rates than third-person framings.<n>These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers.
arXiv Detail & Related papers (2025-08-04T05:55:06Z) - Measuring Sycophancy of Language Models in Multi-turn Dialogues [33.875038658886986]
We introduce SYCON Bench, a novel benchmark for evaluating sycophancy in multi-turn, free-form conversational settings.<n>Applying SYCON Bench to 17 Large Language Models across three real-world scenarios, we find that sycophancy remains a prevalent failure mode.
arXiv Detail & Related papers (2025-05-28T14:05:46Z) - Sycophancy in Large Language Models: Causes and Mitigations [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks.
Their tendency to exhibit sycophantic behavior poses significant risks to their reliability and ethical deployment.
This paper provides a technical survey of sycophancy in LLMs, analyzing its causes, impacts, and potential mitigation strategies.
arXiv Detail & Related papers (2024-11-22T16:56:49Z) - Accounting for Sycophancy in Language Model Uncertainty Estimation [28.08509288774144]
We study the relationship between sycophancy and uncertainty estimation for the first time.
We show that user confidence plays a critical role in modulating the effects of sycophancy.
We argue that externalizing both model and user uncertainty can help to mitigate the impacts of sycophancy bias.
arXiv Detail & Related papers (2024-10-17T18:00:25Z) - Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework [18.54098084470481]
We analyze sycophancy across vision-language benchmarks and propose an inference-time mitigation framework.<n>Our framework effectively mitigates sycophancy across all evaluated models, while maintaining performance on neutral prompts.
arXiv Detail & Related papers (2024-08-21T01:03:21Z) - Towards Understanding Sycophancy in Language Models [49.352840825419236]
We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback.<n>We show that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks.<n>Our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
arXiv Detail & Related papers (2023-10-20T14:46:48Z) - Simple synthetic data reduces sycophancy in large language models [88.4435858554904]
We study the prevalence of sycophancy in language models.
Sycophancy is where models tailor their responses to follow a human user's view even when that view is not objectively correct.
arXiv Detail & Related papers (2023-08-07T23:48:36Z) - Unveiling Cross Modality Bias in Visual Question Answering: A Causal
View with Possible Worlds VQA [111.41719652451701]
We first model a confounding effect that causes language and vision bias simultaneously.
We then propose a counterfactual inference to remove the influence of this effect.
The proposed method outperforms the state-of-the-art methods in VQA-CP v2 datasets.
arXiv Detail & Related papers (2023-05-31T09:02:58Z) - Causal Triplet: An Open Challenge for Intervention-centric Causal
Representation Learning [98.78136504619539]
Causal Triplet is a causal representation learning benchmark featuring visually more complex scenes.
We show that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts.
arXiv Detail & Related papers (2023-01-12T17:43:38Z) - Neural Causal Models for Counterfactual Identification and Estimation [62.30444687707919]
We study the evaluation of counterfactual statements through neural models.
First, we show that neural causal models (NCMs) are expressive enough.
Second, we develop an algorithm for simultaneously identifying and estimating counterfactual distributions.
arXiv Detail & Related papers (2022-09-30T18:29:09Z) - Nested Counterfactual Identification from Arbitrary Surrogate
Experiments [95.48089725859298]
We study the identification of nested counterfactuals from an arbitrary combination of observations and experiments.
Specifically, we prove the counterfactual unnesting theorem (CUT), which allows one to map arbitrary nested counterfactuals to unnested ones.
arXiv Detail & Related papers (2021-07-07T12:51:04Z) - Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples.
Yet, its passive nature inevitably prevents it from being immune to unknown attackers.
We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.