Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs
- URL: http://arxiv.org/abs/2511.17220v1
- Date: Fri, 21 Nov 2025 13:01:28 GMT
- Title: Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs
- Authors: Yusuf Çelebi, Mahmoud El Hussieni, Özay Ezerceli,
- Abstract summary: PARROT (Persuasion and Agreement Robustness Rating of Output Truth) is a robustness focused framework designed to measure the degradation in accuracy under social pressure exerted on users.<n>We evaluate 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low "follow rates" ($\leq 11\%$, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of "resistance to overfitting pressure" should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.
Related papers
- RAudit: A Blind Auditing Protocol for Large Language Model Reasoning [0.8594140167290097]
Inference-time scaling can amplify reasoning pathologies: sycophancy, rung collapse, and premature certainty.<n>We present RAudit, a diagnostic protocol for auditing LLM reasoning without ground truth access.
arXiv Detail & Related papers (2026-01-30T16:22:45Z) - AdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk Domains [3.721111684544962]
Hallucination in large language models (LLMs) contributes to spread of misinformation and diminished public trust.<n>We introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality.<n>We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates.
arXiv Detail & Related papers (2026-01-21T22:47:59Z) - Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning [32.32593439144886]
Behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification.<n>Our model's log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5's (0.207) in a challenging in-domain evaluation.
arXiv Detail & Related papers (2025-12-22T22:51:48Z) - CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal [84.71254539482369]
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures.<n>We present CARE, a failure-centric post-training framework for multimodal reasoning that turns errors into supervision.<n> CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures.
arXiv Detail & Related papers (2025-12-22T16:34:21Z) - Compressed Causal Reasoning: Quantization and GraphRAG Effects on Interventional and Counterfactual Accuracy [0.0]
This study systematically evaluate quantization effects across all three levels of Pearls Causal Ladder.<n>We find that rung level accuracy in Llama 3 8B remains broadly stable under quantization, with NF4 showing less than one percent overall degradation.<n>Experiments on the CRASS benchmark show near identical performance across precisions, indicating that existing commonsense counterfactual datasets lack the structural sensitivity needed to reveal quantization induced reasoning drift.
arXiv Detail & Related papers (2025-12-13T17:54:15Z) - Benchmarking Corruption Robustness of LVLMs: A Discriminative Benchmark and Robustness Alignment Metric [49.393713730706445]
We introduce Bench-C, a benchmark emphasizing discriminative samples for assessing corruption robustness.<n>We propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure.
arXiv Detail & Related papers (2025-11-24T12:07:56Z) - The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models [1.4323566945483497]
We present the first systematic investigation of "chameleon behavior" in Large Language Models.<n>We expose fundamental flaws in state-of-the-art systems.<n>Our analysis uncovers the mechanism: strong correlations between source re-use rate and confidence are statistically significant.
arXiv Detail & Related papers (2025-10-19T04:51:14Z) - CLUE: Non-parametric Verification from Experience via Hidden-State Clustering [64.50919789875233]
We show that correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations.<n>ClUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates.
arXiv Detail & Related papers (2025-10-02T02:14:33Z) - Causally-Enhanced Reinforcement Policy Optimization [36.523007244998695]
Causally-Enhanced Policy Optimization (CE-PO) is a drop-in reward-shaping framework that augments policy optimization with a differentiable proxy for causal coherence.<n>CE-PO estimates model-internal influence with Jacobian-based sensitivities, counterfactually hardens these signals to suppress nuisance cues, and fuses the resulting coherence score with task-accuracy feedback.<n> Experimental results across 4 datasets show that CE-PO improves accuracy over baselines by 5.49% on average (up to 9.58%), while improving robustness to correlation-causation flips and light counterfactual edits.
arXiv Detail & Related papers (2025-09-27T04:10:16Z) - ConfTuner: Training Large Language Models to Express Their Confidence Verbally [58.63318088243125]
Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare.<n>LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence"
arXiv Detail & Related papers (2025-08-26T09:25:32Z) - Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA [36.21980066799023]
sycophancy is the tendency to align with user beliefs regardless of correctness.<n>Despite its importance, sycophancy remains underexamined in factual question answering contexts.<n>We introduce a unified evaluation framework to quantify the impact of sycophantic context on model behavior.
arXiv Detail & Related papers (2025-08-19T11:30:52Z) - The Confidence Paradox: Can LLM Know When It's Wrong [5.445980143646736]
Document Visual Question Answering (DocVQA) models often produce overconfident or ethically misaligned responses.<n>We propose HonestVQA, a model-agnostic, self-supervised framework that aligns model confidence with correctness using weighted loss and contrastive learning.
arXiv Detail & Related papers (2025-06-30T02:06:54Z) - Reasoning Models Are More Easily Gaslighted Than You Think [85.84943447589511]
We evaluate three state-of-the-art reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash.<n>Our evaluation reveals significant accuracy drops following gaslighting negation prompts.<n>We introduce GaslightingBench-R, a new diagnostic benchmark designed to evaluate reasoning models' susceptibility to defend their belief.
arXiv Detail & Related papers (2025-06-11T12:52:25Z) - SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization [57.69385990442078]
Large language models (LLMs) exhibit varying levels of confidence across input prompts (questions)<n>Semantic entropy measures the diversity of meaning in multiple generated answers given a prompt and uses this to modulate the magnitude of policy updates.
arXiv Detail & Related papers (2025-05-18T10:20:59Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning [53.25336975467293]
We present the first theoretical error decomposition analysis of methods such as perplexity and self-consistency.<n>Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function.<n>We propose Reasoning-Pruning Perplexity Consistency (RPC), which integrates perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths.
arXiv Detail & Related papers (2025-02-01T18:09:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.