Related papers: The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models

The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models

URL: http://arxiv.org/abs/2510.16712v2
Date: Sun, 26 Oct 2025 23:59:21 GMT
Title: The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models
Authors: Shivam Ratnakar, Sanjay Raghavendra,
Abstract summary: We present the first systematic investigation of "chameleon behavior" in Large Language Models.<n>We expose fundamental flaws in state-of-the-art systems.<n>Our analysis uncovers the mechanism: strong correlations between source re-use rate and confidence are statistically significant.
Score: 1.4323566945483497
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Integration of Large Language Models with search/retrieval engines has become ubiquitous, yet these systems harbor a critical vulnerability that undermines their reliability. We present the first systematic investigation of "chameleon behavior" in LLMs: their alarming tendency to shift stances when presented with contradictory questions in multi-turn conversations (especially in search-enabled LLMs). Through our novel Chameleon Benchmark Dataset, comprising 17,770 carefully crafted question-answer pairs across 1,180 multi-turn conversations spanning 12 controversial domains, we expose fundamental flaws in state-of-the-art systems. We introduce two theoretically grounded metrics: the Chameleon Score (0-1) that quantifies stance instability, and Source Re-use Rate (0-1) that measures knowledge diversity. Our rigorous evaluation of Llama-4-Maverick, GPT-4o-mini, and Gemini-2.5-Flash reveals consistent failures: all models exhibit severe chameleon behavior (scores 0.391-0.511), with GPT-4o-mini showing the worst performance. Crucially, small across-temperature variance (less than 0.004) suggests the effect is not a sampling artifact. Our analysis uncovers the mechanism: strong correlations between source re-use rate and confidence (r=0.627) and stance changes (r=0.429) are statistically significant (p less than 0.05), indicating that limited knowledge diversity makes models pathologically deferential to query framing. These findings highlight the need for comprehensive consistency evaluation before deploying LLMs in healthcare, legal, and financial systems where maintaining coherent positions across interactions is critical for reliable decision support.

Related papers

Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions [8.026492468995187]
Small models exhibit extreme compliance, with over 80% of belief changes occurring at the first persuasive turn.<n> meta-cognition prompting increases vulnerability by accelerating belief erosion rather than enhancing robustness.<n>These findings highlight substantial model-dependent limits of current robustness interventions.
arXiv Detail & Related papers (2026-01-20T04:43:55Z)
Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs [0.0]
PARROT (Persuasion and Agreement Robustness Rating of Output Truth) is a robustness focused framework designed to measure the degradation in accuracy under social pressure exerted on users.<n>We evaluate 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates.
arXiv Detail & Related papers (2025-11-21T13:01:28Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
Shallow Robustness, Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLMs [9.291589998223696]
We introduce MedQA-Followup, a framework for evaluating multi-turn robustness in medical question answering.<n>Using controlled interventions on the MedQA dataset, we evaluate five state-of-the-art LLMs.<n>We find that while models perform reasonably well under shallow perturbations, they exhibit severe vulnerabilities in multi-turn settings.
arXiv Detail & Related papers (2025-10-14T08:04:18Z)
Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency [10.052307738781678]
Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations--generating content that appears plausible but contains factual inaccuracies.<n>We present Finch-Zk, a framework that leverages FINe-grained Cross-model consistency to detect and mitigate Hallucinations in LLM outputs without requiring external knowledge sources.
arXiv Detail & Related papers (2025-08-19T23:45:34Z)
Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History [7.58175460763641]
Even 400B+ models exhibit substantial response variability.<n> Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability.<n>For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.
arXiv Detail & Related papers (2025-08-06T19:11:33Z)
Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs [87.79350168490475]
This work formally defines self-consistent errors and evaluates mainstream detection methods on them.<n>All four types of detection methods significantly struggle to detect self-consistent errors.<n>Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe.
arXiv Detail & Related papers (2025-05-23T09:18:56Z)
Seeing What's Not There: Spurious Correlation in Multimodal LLMs [47.651861502104715]
We introduce SpurLens, a pipeline that automatically identifies spurious visual cues without human supervision.<n>Our findings reveal that spurious correlations cause two major failure modes in Multimodal Large Language Models (MLLMs)<n>By exposing the persistence of spurious correlations, our study calls for more rigorous evaluation methods and mitigation strategies to enhance the reliability of MLLMs.
arXiv Detail & Related papers (2025-03-11T20:53:00Z)
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models [45.63440666848143]
Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities.<n>Despite their success, MLLMs remain vulnerable to conversational adversarial inputs.<n>We study gaslighting negation attacks: a phenomenon where models, despite initially providing correct answers, are persuaded by user-provided negations to reverse their outputs.
arXiv Detail & Related papers (2025-01-31T10:37:48Z)
Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models [8.846200844870767]
We discover an understudied type of undesirable behavior of Large Language Models (LLMs)<n>We term Verbosity Compensation (VC) as similar to the hesitation behavior of humans under uncertainty.<n>We propose a simple yet effective cascade algorithm that replaces verbose responses with the other model-generated responses.
arXiv Detail & Related papers (2024-11-12T15:15:20Z)
"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.