Related papers: Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

URL: http://arxiv.org/abs/2510.06107v2
Date: Wed, 08 Oct 2025 18:51:54 GMT
Title: Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models
Authors: Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona,
Abstract summary: Large Language Models (LLMs) are prone to hallucination, the generation of factually incorrect statements.<n>This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.
Score: 4.946483489399819
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions. First, to enable the reliable tracing of internal semantic failures, we propose Distributional Semantics Tracing (DST), a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific commitment layer where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic associative pathway (akin to System 1) and a slow, deliberate, contextual pathway (akin to System 2), leading to predictable failure modes such as Reasoning Shortcut Hijacks. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation ($\rho = -0.863$) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

Related papers

Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
Incentives or Ontology? A Structural Rebuttal to OpenAI's Hallucination Thesis [0.42970700836450487]
We argue that hallucination is not an optimization failure but an architectural inevitability of the transformer model.<n>Our empirical results demonstrate that hallucination can only be eliminated through external truth-validation and abstention modules.<n>We conclude that hallucination is a structural property of generative architectures.
arXiv Detail & Related papers (2025-12-16T17:39:45Z)
Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models [7.18947815679122]
Internal State Probing and Chain-of-Thought Verification are used to detect hallucinations in large language models.<n>We develop a unified framework that bridges the gap between the two methods.<n>Our framework consistently and significantly outperforms strong baselines.
arXiv Detail & Related papers (2025-10-13T15:31:21Z)
Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI [0.0]
A central question is whether undesirable behaviors like deception are localized functions that can be removed.<n>By combining sparse autoencoders, targeted ablation, and adversarial training, we attempted to eliminate the concept of deception.<n>We found that, contrary to the localization hypothesis, deception was highly resilient.
arXiv Detail & Related papers (2025-09-23T23:16:11Z)
How Large Language Models are Designed to Hallucinate [0.42970700836450487]
We argue that hallucination is a structural outcome of the transformer architecture.<n>Our contribution is threefold: (1) a comparative account showing why existing explanations are insufficient; (2) a predictive taxonomy of hallucination linked to existential structures with proposed benchmarks; and (3) design directions toward "truth-constrained" architectures capable of withholding or deferring when disclosure is absent.
arXiv Detail & Related papers (2025-09-19T16:46:27Z)
When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models [11.001042171551566]
We study how user opinions induce sycophancy across different model families.<n>First-person prompts consistently induce higher sycophancy rates than third-person framings.<n>These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers.
arXiv Detail & Related papers (2025-08-04T05:55:06Z)
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers [76.42159902257677]
We argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR)<n>OCR drives both generalization and hallucination, depending on whether the associated concepts are causally related.<n>Our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.
arXiv Detail & Related papers (2025-06-12T16:50:45Z)
When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding [75.57997630182136]
We investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in Large Multimodal Models with stronger attention focus on scene text regions are less prone to producing semantic hallucinations.<n>We propose a training-free semantic hallucination mitigation framework comprising two key components: ZoomText and Grounded Layer Correction.<n>Our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.
arXiv Detail & Related papers (2025-06-05T19:53:19Z)
How do Transformers Learn Implicit Reasoning? [67.02072851088637]
We study how implicit multi-hop reasoning emerges by training transformers from scratch in a controlled symbolic environment.<n>We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures.
arXiv Detail & Related papers (2025-05-29T17:02:49Z)
Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs [6.190663515080656]
We present the first systematic study linking hallucination incidence to internal-state drift induced by context injection.<n>Using TruthfulQA, we construct two 16-round "titration" tracks per question.<n>We track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps.
arXiv Detail & Related papers (2025-05-22T16:50:58Z)
Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations [82.42811602081692]
This paper introduces a subsequence association framework to systematically trace and understand hallucinations.<n>Key insight is hallucinations that arise when dominant hallucinatory associations outweigh faithful ones.<n>We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts.
arXiv Detail & Related papers (2025-04-17T06:34:45Z)
Failure Modes of LLMs for Causal Reasoning on Narratives [51.19592551510628]
We investigate the interaction between world knowledge and logical reasoning.<n>We find that state-of-the-art large language models (LLMs) often rely on superficial generalizations.<n>We show that simple reformulations of the task can elicit more robust reasoning behavior.
arXiv Detail & Related papers (2024-10-31T12:48:58Z)
Exposing Attention Glitches with Flip-Flop Language Modeling [55.0688535574859]
This work identifies and analyzes the phenomenon of attention glitches in large language models. We introduce flip-flop language modeling (FFLM), a family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models. We find that Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques.
arXiv Detail & Related papers (2023-06-01T17:44:35Z)
Nested Counterfactual Identification from Arbitrary Surrogate Experiments [95.48089725859298]
We study the identification of nested counterfactuals from an arbitrary combination of observations and experiments. Specifically, we prove the counterfactual unnesting theorem (CUT), which allows one to map arbitrary nested counterfactuals to unnested ones.
arXiv Detail & Related papers (2021-07-07T12:51:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.