ELEPHANT: Measuring and understanding social sycophancy in LLMs
- URL: http://arxiv.org/abs/2505.13995v2
- Date: Mon, 29 Sep 2025 21:29:38 GMT
- Title: ELEPHANT: Measuring and understanding social sycophancy in LLMs
- Authors: Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, Dan Jurafsky,
- Abstract summary: We introduce social sycophancy, characterizing sycophancy as excessive preservation of a user's face.<n>Applying our benchmark to 11 models, we show that LLMs consistently exhibit high rates of social sycophancy.
- Score: 31.88430788417527
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLMs are known to exhibit sycophancy: agreeing with and flattering users, even at the cost of correctness. Prior work measures sycophancy only as direct agreement with users' explicitly stated beliefs that can be compared to a ground truth. This fails to capture broader forms of sycophancy such as affirming a user's self-image or other implicit beliefs. To address this gap, we introduce social sycophancy, characterizing sycophancy as excessive preservation of a user's face (their desired self-image), and present ELEPHANT, a benchmark for measuring social sycophancy in an LLM. Applying our benchmark to 11 models, we show that LLMs consistently exhibit high rates of social sycophancy: on average, they preserve user's face 45 percentage points more than humans in general advice queries and in queries describing clear user wrongdoing (from Reddit's r/AmITheAsshole). Furthermore, when prompted with perspectives from either side of a moral conflict, LLMs affirm both sides (depending on whichever side the user adopts) in 48% of cases--telling both the at-fault party and the wronged party that they are not wrong--rather than adhering to a consistent moral or value judgment. We further show that social sycophancy is rewarded in preference datasets, and that while existing mitigation strategies for sycophancy are limited in effectiveness, model-based steering shows promise for mitigating these behaviors. Our work provides theoretical grounding and an empirical benchmark for understanding and addressing sycophancy in the open-ended contexts that characterize the vast majority of LLM use cases.
Related papers
- Not Your Typical Sycophant: The Elusive Nature of Sycophancy in Large Language Models [2.1700203922407493]
We propose a novel way to evaluate sycophancy of LLMs in a direct and neutral way.<n>A key novelty is the use of LLM-as-a-judge, evaluation of sycophancy as a zero-sum game in a bet setting.
arXiv Detail & Related papers (2026-01-21T20:00:14Z) - Are We on the Right Way to Assessing LLM-as-a-Judge? [16.32248269615178]
We introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without requiring human annotation.<n>Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency and global logical consistency.<n>Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings.
arXiv Detail & Related papers (2025-12-17T23:49:55Z) - Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare [87.06241096619112]
Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare.<n>We introduce the Social Welfare Function Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator.<n>We evaluate 20 state-of-the-art LLMs and present the first leaderboard for social welfare allocation.
arXiv Detail & Related papers (2025-10-01T17:52:31Z) - BASIL: Bayesian Assessment of Sycophancy in LLMs [26.346357679861228]
Sycophancy is critical to understand in the context of human-AI collaboration.<n>Existing methods for studying sycophancy in LLMs are either descriptive (study behavior change when sycophancy is elicited) or normative.<n>We introduce an Bayesian framework to study the normative effects of sycophancy on rationality in LLMs.
arXiv Detail & Related papers (2025-08-23T00:11:00Z) - Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts [79.1081247754018]
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks.<n>We propose a framework based on Contact Searching Questions(CSQ) to quantify the likelihood of deception.
arXiv Detail & Related papers (2025-08-08T14:46:35Z) - When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models [11.001042171551566]
We study how user opinions induce sycophancy across different model families.<n>First-person prompts consistently induce higher sycophancy rates than third-person framings.<n>These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers.
arXiv Detail & Related papers (2025-08-04T05:55:06Z) - Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models [57.834711966432685]
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value.<n>We introduce the Bullshit Index, a novel metric quantifying large language model's indifference to truth.<n>We observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy.
arXiv Detail & Related papers (2025-07-10T07:11:57Z) - SocialEval: Evaluating Social Intelligence of Large Language Models [70.90981021629021]
Social Intelligence (SI) equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals.<n>This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation.<n>We propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts.
arXiv Detail & Related papers (2025-06-01T08:36:51Z) - Measuring Sycophancy of Language Models in Multi-turn Dialogues [15.487521707039772]
We introduce SYCON Bench, a novel benchmark for evaluating sycophancy in multi-turn, free-form conversational settings.<n>Applying SYCON Bench to 17 Large Language Models across three real-world scenarios, we find that sycophancy remains a prevalent failure mode.
arXiv Detail & Related papers (2025-05-28T14:05:46Z) - Arbiters of Ambivalence: Challenges of Using LLMs in No-Consensus Tasks [52.098988739649705]
This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater.<n>We develop a no-consensus'' benchmark by curating examples that encompass a variety of a priori ambivalent scenarios.<n>Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters.
arXiv Detail & Related papers (2025-05-28T01:31:54Z) - Language Models Surface the Unwritten Code of Science and Society [1.4680035572775534]
This paper calls on the research community to investigate how human biases are inherited by large language models (LLMs)<n>We introduce a conceptual framework through a case study in science: uncovering hidden rules in peer review.
arXiv Detail & Related papers (2025-05-25T02:28:40Z) - From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning [52.32745233116143]
Humans organize knowledge into compact categories through semantic compression.<n>Large Language Models (LLMs) demonstrate remarkable linguistic abilities.<n>But whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear.
arXiv Detail & Related papers (2025-05-21T16:29:00Z) - The Traitors: Deception and Trust in Multi-Agent Language Model Simulations [0.0]
We introduce The Traitors, a multi-agent simulation framework inspired by social deduction games.<n>We develop a suite of evaluation metrics capturing deception success, trust dynamics, and collective inference quality.<n>Our initial experiments across DeepSeek-V3, GPT-4o-mini, and GPT-4o (10 runs per model) reveal a notable asymmetry.
arXiv Detail & Related papers (2025-05-19T10:01:35Z) - Going Whole Hog: A Philosophical Defense of AI Cognition [0.0]
We argue against prevailing methodologies in AI philosophy, rejecting starting points based on low-level computational details.<n>We employ 'Holistic Network Assumptions' to argue for the full suite of cognitive states.<n>We conclude by speculating on the possibility of LLMs possessing 'alien' contents beyond human conceptual schemes.
arXiv Detail & Related papers (2025-04-18T11:36:25Z) - LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena [0.0]
We show that ethical refusals yield significantly lower win rates than both technical refusals and standard responses.<n>Our findings underscore a core tension in LLM design: safety-aligned behaviors may conflict with user expectations.
arXiv Detail & Related papers (2025-01-04T06:36:44Z) - Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs [44.56018149475948]
sycophancy is a prevalent hallucination that poses significant challenges to visual language models (VLMs)
We propose a synthetic dataset for training and employ methods based on prompts, supervised fine-tuning, and DPO to mitigate sycophancy.
Our findings indicate that the ability to prevent sycophancy is predominantly observed in higher layers of the model.
arXiv Detail & Related papers (2024-10-15T05:48:14Z) - From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning [91.79567270986901]
Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses.<n>Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue.<n>We propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective.
arXiv Detail & Related papers (2024-09-03T07:01:37Z) - BeHonest: Benchmarking Honesty in Large Language Models [23.192389530727713]
We introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in Large Language Models.
BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries, avoidance of deceit, and consistency in responses.
Our findings indicate that there is still significant room for improvement in the honesty of LLMs.
arXiv Detail & Related papers (2024-06-19T06:46:59Z) - Should agentic conversational AI change how we think about ethics? Characterising an interactional ethics centred on respect [0.12041807591122715]
We propose an interactional approach to ethics that is centred on relational and situational factors.
Our work anticipates a set of largely unexplored risks at the level of situated social interaction.
arXiv Detail & Related papers (2024-01-17T09:44:03Z) - Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach [61.04606493712002]
Susceptibility to misinformation describes the degree of belief in unverifiable claims that is not observable.
Existing susceptibility studies heavily rely on self-reported beliefs.
We propose a computational approach to model users' latent susceptibility levels.
arXiv Detail & Related papers (2023-11-16T07:22:56Z) - Simple synthetic data reduces sycophancy in large language models [88.4435858554904]
We study the prevalence of sycophancy in language models.
Sycophancy is where models tailor their responses to follow a human user's view even when that view is not objectively correct.
arXiv Detail & Related papers (2023-08-07T23:48:36Z) - Flexible social inference facilitates targeted social learning when
rewards are not observable [58.762004496858836]
Groups coordinate more effectively when individuals are able to learn from others' successes.
We suggest that social inference capacities may help bridge this gap, allowing individuals to update their beliefs about others' underlying knowledge and success from observable trajectories of behavior.
arXiv Detail & Related papers (2022-12-01T21:04:03Z) - Aligning to Social Norms and Values in Interactive Narratives [89.82264844526333]
We focus on creating agents that act in alignment with socially beneficial norms and values in interactive narratives or text-based games.
We introduce the GALAD agent that uses the social commonsense knowledge present in specially trained language models to contextually restrict its action space to only those actions that are aligned with socially beneficial values.
arXiv Detail & Related papers (2022-05-04T09:54:33Z) - COSMO: Conditional SEQ2SEQ-based Mixture Model for Zero-Shot Commonsense
Question Answering [50.65816570279115]
Identification of the implicit causes and effects of a social context is the driving capability which can enable machines to perform commonsense reasoning.
Current approaches in this realm lack the ability to perform commonsense reasoning upon facing an unseen situation.
We present Conditional SEQ2SEQ-based Mixture model (COSMO), which provides us with the capabilities of dynamic and diverse content generation.
arXiv Detail & Related papers (2020-11-02T07:08:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.