Related papers: Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

URL: http://arxiv.org/abs/2510.16727v1
Date: Sun, 19 Oct 2025 06:36:57 GMT
Title: Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models
Authors: Sanskar Pandey, Ruhaan Chopra, Angkul Puniya, Sohom Pal,
Abstract summary: Large language models internalize a structural trade-off between truthfulness and obsequious flattery.<n>This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning.<n>We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.

Related papers

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling [49.41422138354821]
We propose a principled reward modeling framework that integrates non-negative factor analysis into the Bradley-Terry preference model.<n>BNRM represents rewards through a sparse, non-negative latent factor generative process.<n>We show that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.
arXiv Detail & Related papers (2026-02-11T08:14:11Z)
BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models [7.174865411448373]
We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation.<n>On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce prediction consistency between causal and spurious regions.<n>On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space.
arXiv Detail & Related papers (2026-01-05T14:22:20Z)
The Procrustean Bed of Time Series: The Optimization Bias of Point-wise Loss [53.542743390809356]
This paper aims to provide a first-principles analysis of the Expectation of Optimization Bias (EOB)<n>Our analysis reveals a fundamental paradigm paradox: the more deterministic and structured the time series, the more severe the bias by point-wise loss function.<n>We present a concrete solution that simultaneously achieves both principles via DFT or DWT.
arXiv Detail & Related papers (2025-12-21T06:08:22Z)
Drift No More? Context Equilibria in Multi-Turn LLM Interactions [58.69551510148673]
contexts drift is the gradual divergence of a model's outputs from goal-consistent behavior across turns.<n>Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics.<n>We show that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay.
arXiv Detail & Related papers (2025-10-09T04:48:49Z)
Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA [36.21980066799023]
sycophancy is the tendency to align with user beliefs regardless of correctness.<n>Despite its importance, sycophancy remains underexamined in factual question answering contexts.<n>We introduce a unified evaluation framework to quantify the impact of sycophantic context on model behavior.
arXiv Detail & Related papers (2025-08-19T11:30:52Z)
KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models [1.649505438157608]
Large language models (LLMs) often exhibit societal biases in their outputs, prompting ethical concerns regarding fairness and harm.<n>We propose KLAAD (KL-Attention Alignment Debiasing), an attention-based debiasing framework that implicitly aligns attention distributions between stereotypical and anti-stereotypical sentence pairs.<n> Experimental evaluation of KLAAD demonstrates improved bias mitigation on both the BBQ and BOLD benchmarks, with minimal impact on language modeling quality.
arXiv Detail & Related papers (2025-07-26T14:24:19Z)
MIST: Towards Multi-dimensional Implicit BiaS Evaluation of LLMs via Theory of Mind [27.209638457499427]
Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states.<n>We propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality.
arXiv Detail & Related papers (2025-06-17T03:50:57Z)
Consistent World Models via Foresight Diffusion [56.45012929930605]
We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability.<n>We propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising.
arXiv Detail & Related papers (2025-05-22T10:01:59Z)
Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings.<n>We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z)
Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework [18.54098084470481]
We analyze sycophancy across vision-language benchmarks and propose an inference-time mitigation framework.<n>Our framework effectively mitigates sycophancy across all evaluated models, while maintaining performance on neutral prompts.
arXiv Detail & Related papers (2024-08-21T01:03:21Z)
Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Self-supervised debiasing using low rank regularization [59.84695042540525]
Spurious correlations can cause strong biases in deep neural networks, impairing generalization ability. We propose a self-supervised debiasing framework potentially compatible with unlabeled samples. Remarkably, the proposed debiasing framework significantly improves the generalization performance of self-supervised learning baselines.
arXiv Detail & Related papers (2022-10-11T08:26:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.