Related papers: Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA

Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA

URL: http://arxiv.org/abs/2508.13743v1
Date: Tue, 19 Aug 2025 11:30:52 GMT
Title: Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA
Authors: Kaiwei Zhang, Qi Jia, Zijian Chen, Wei Sun, Xiangyang Zhu, Chunyi Li, Dandan Zhu, Guangtao Zhai,
Abstract summary: sycophancy is the tendency to align with user beliefs regardless of correctness.<n>Despite its importance, sycophancy remains underexamined in factual question answering contexts.<n>We introduce a unified evaluation framework to quantify the impact of sycophantic context on model behavior.
Score: 36.21980066799023
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs), while increasingly used in domains requiring factual rigor, often display a troubling behavior: sycophancy, the tendency to align with user beliefs regardless of correctness. This tendency is reinforced by preference-based alignment techniques that optimize for user satisfaction but can undermine truthfulness. While relatively benign in casual dialogue, sycophancy poses serious risks in high-stakes settings such as scientific question answering (QA), where model outputs may shape collaborative reasoning, decision-making, and knowledge formation. Despite its importance, this phenomenon remains underexamined in factual QA contexts. We address this gap by introducing a unified evaluation framework to quantify the impact of sycophantic context on model behavior in scientific QA, measuring how much user-imposed social pressure distorts model outputs. The framework incorporates adversarial prompting setups and targeted metrics, such as misleading resistance and sycophancy resistance, that capture a model's ability to maintain factual consistency under misleading cues. Systematic evaluations across open-source and proprietary models reveal pervasive sycophantic tendencies, driven more by alignment strategy than by model size. To mitigate this issue, we propose Pressure-Tune, a lightweight post-training method that fine-tunes models on synthetic adversarial dialogues paired with chain-of-thought rationales. These rationales reject user misinformation while reinforcing factual commitments. Experiments on challenging scientific QA benchmarks show that Pressure-Tune significantly enhances sycophancy resistance without compromising accuracy or responsiveness to valid feedback, offering a practical pathway toward more truthful and principled model behavior.

Related papers

The Drill-Down and Fabricate Test (DDFT): A Protocol for Measuring Epistemic Robustness in Language Models [0.0]
Current language model evaluations measure what models know under ideal conditions but not how robustly they know it under realistic stress.<n>We introduce the Drill-Down Fabricate Test (DDFT), a protocol that measures robustness.<n>We find flagship models exhibit brittleness despite their scale, while smaller models can achieve robust performance.
arXiv Detail & Related papers (2025-12-29T20:29:09Z)
Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs [0.0]
PARROT (Persuasion and Agreement Robustness Rating of Output Truth) is a robustness focused framework designed to measure the degradation in accuracy under social pressure exerted on users.<n>We evaluate 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates.
arXiv Detail & Related papers (2025-11-21T13:01:28Z)
Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models [0.0]
Large language models internalize a structural trade-off between truthfulness and obsequious flattery.<n>This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning.<n>We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context.
arXiv Detail & Related papers (2025-10-19T06:36:57Z)
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios [57.327907850766785]
characterization of deception across realistic real-world scenarios remains underexplored.<n>We establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different domains.<n>On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement.<n>We incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics.
arXiv Detail & Related papers (2025-10-17T10:14:26Z)
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z)
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
On the Reasoning Capacity of AI Models and How to Quantify It [0.0]
Large Language Models (LLMs) have intensified the debate surrounding the fundamental nature of their reasoning capabilities.<n>While achieving high performance on benchmarks such as GPQA and MMLU, these models exhibit limitations in more complex reasoning tasks.<n>We propose a novel phenomenological approach that goes beyond traditional accuracy metrics to probe the underlying mechanisms of model behavior.
arXiv Detail & Related papers (2025-01-23T16:58:18Z)
Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference [22.65765161695905]
Existing full-reference image quality assessment (FR-IQA) methods often fail to capture the complex causal mechanisms that underlie human perceptual responses to image distortions.<n>We propose an FR-IQA method based on abductive counterfactual inference to investigate the causal relationships between deep network features and perceptual distortions.
arXiv Detail & Related papers (2024-12-22T09:17:57Z)
Sycophancy in Large Language Models: Causes and Mitigations [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. Their tendency to exhibit sycophantic behavior poses significant risks to their reliability and ethical deployment. This paper provides a technical survey of sycophancy in LLMs, analyzing its causes, impacts, and potential mitigation strategies.
arXiv Detail & Related papers (2024-11-22T16:56:49Z)
Accounting for Sycophancy in Language Model Uncertainty Estimation [28.08509288774144]
We study the relationship between sycophancy and uncertainty estimation for the first time. We show that user confidence plays a critical role in modulating the effects of sycophancy. We argue that externalizing both model and user uncertainty can help to mitigate the impacts of sycophancy bias.
arXiv Detail & Related papers (2024-10-17T18:00:25Z)
Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework [18.54098084470481]
We analyze sycophancy across vision-language benchmarks and propose an inference-time mitigation framework.<n>Our framework effectively mitigates sycophancy across all evaluated models, while maintaining performance on neutral prompts.
arXiv Detail & Related papers (2024-08-21T01:03:21Z)
Extreme Miscalibration and the Illusion of Adversarial Robustness [66.29268991629085]
Adversarial Training is often used to increase model robustness. We show that this observed gain in robustness is an illusion of robustness (IOR) We urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations.
arXiv Detail & Related papers (2024-02-27T13:49:12Z)
Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z)
Non-Singular Adversarial Robustness of Neural Networks [58.731070632586594]
Adrial robustness has become an emerging challenge for neural network owing to its over-sensitivity to small input perturbations. We formalize the notion of non-singular adversarial robustness for neural networks through the lens of joint perturbations to data inputs as well as model weights.
arXiv Detail & Related papers (2021-02-23T20:59:30Z)
Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning. These measures should account for the wide variety of models used in practice. The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.