Related papers: Disentangling Deception and Hallucination Failures in LLMs

Disentangling Deception and Hallucination Failures in LLMs

URL: http://arxiv.org/abs/2602.14529v1
Date: Mon, 16 Feb 2026 07:36:49 GMT
Title: Disentangling Deception and Hallucination Failures in LLMs
Authors: Haolang Lu, Hongrui Peng, WeiYe Fu, Guoshun Nan, Xinye Cao, Xingrui Li, Hongcan Guo, Kun Wang,
Abstract summary: We propose an internal, mechanism-oriented perspective that separates Knowledge Existence from Behavior Expression.<n> hallucination and deception correspond to two qualitatively different failure modes that may appear similar at the output level but differ in their underlying mechanisms.<n>We analyze these failure modes through representation separability, sparse interpretability, and inference-time activation steering.
Score: 7.906722750233381
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Failures in large language models (LLMs) are often analyzed from a behavioral perspective, where incorrect outputs in factual question answering are commonly associated with missing knowledge. In this work, focusing on entity-based factual queries, we suggest that such a view may conflate different failure mechanisms, and propose an internal, mechanism-oriented perspective that separates Knowledge Existence from Behavior Expression. Under this formulation, hallucination and deception correspond to two qualitatively different failure modes that may appear similar at the output level but differ in their underlying mechanisms. To study this distinction, we construct a controlled environment for entity-centric factual questions in which knowledge is preserved while behavioral expression is selectively altered, enabling systematic analysis of four behavioral cases. We analyze these failure modes through representation separability, sparse interpretability, and inference-time activation steering.

Related papers

Causality is Key for Interpretability Claims to Generalise [35.833847356014154]
Interpretability research on large language models (LLMs) has yielded important insights into model behaviour.<n> recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence.<n>Pearl's causal hierarchy clarifies what an interpretability study can justify.
arXiv Detail & Related papers (2026-02-18T18:45:04Z)
Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification [27.02252748004729]
Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation.<n>They frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions.<n>Evidential Uncertainty Quantification (EUQ) captures both information conflict and ignorance for effective detection of LVLM misbehaviors.
arXiv Detail & Related papers (2026-02-05T10:51:39Z)
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process [66.38541693477181]
We propose an unsupervised framework for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors.<n>By segmenting chain-of-thought traces into sentence-level'steps', we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking.<n>We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space.
arXiv Detail & Related papers (2025-12-30T05:09:11Z)
Cognitive Foundations for Reasoning and Their Manifestation in LLMs [63.12951576410617]
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning.<n>We synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations.<n>We develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems.
arXiv Detail & Related papers (2025-11-20T18:59:00Z)
Self-Correcting Large Language Models: Generation vs. Multiple Choice [29.697851249014192]
Large language models have recently demonstrated remarkable abilities to self-correct their responses through iterative refinement.<n>We compare performance trends and error-correction behaviors across various natural language understanding and reasoning tasks.<n>Our findings highlight that the design of self-correction mechanisms should take into account the interaction between task structure and output space.
arXiv Detail & Related papers (2025-11-12T14:46:40Z)
Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models [4.946483489399819]
Large Language Models (LLMs) are prone to hallucination, the generation of factually incorrect statements.<n>This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.
arXiv Detail & Related papers (2025-10-07T16:40:31Z)
Investigating VLM Hallucination from a Cognitive Psychology Perspective: A First Step Toward Interpretation with Intriguing Observations [60.63340688538124]
Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs)<n>Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations.<n>In this work, we introduce a psychological taxonomy, categorizing VLMs' cognitive biases that lead to hallucinations, including sycophancy, logical inconsistency, and a newly identified VLMs behaviour: appeal to authority.
arXiv Detail & Related papers (2025-07-03T19:03:16Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Failure Modes of LLMs for Causal Reasoning on Narratives [51.19592551510628]
We investigate the interaction between world knowledge and logical reasoning.<n>We find that state-of-the-art large language models (LLMs) often rely on superficial generalizations.<n>We show that simple reformulations of the task can elicit more robust reasoning behavior.
arXiv Detail & Related papers (2024-10-31T12:48:58Z)
Self-correction is Not An Innate Capability in Large Language Models [13.268938380591765]
We investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs?<n>We show that moral self-correction is not an inherent capability of LLMs, as they are neither morally sensitive nor able to effectively incorporate external feedback during the self-correction process.
arXiv Detail & Related papers (2024-10-27T16:52:21Z)
Feedback in Imitation Learning: Confusion on Causality and Covariate Shift [12.93527098342393]
We argue that conditioning policies on previous actions leads to a dramatic divergence between "held out" error and performance of the learner in situ. We analyze existing benchmarks used to test imitation learning approaches. We find, in a surprising contrast with previous literature, that naive behavioral cloning provides excellent results.
arXiv Detail & Related papers (2021-02-04T20:18:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.