InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2510.21538v1
- Date: Fri, 24 Oct 2025 15:02:01 GMT
- Title: InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation
- Authors: Likun Tan, Kuan-Wei Huang, Joy Shi, Kevin Wu,
- Abstract summary: hallucination detection requires disentangling the contributions of external context and parametric knowledge.<n>We investigate the mechanisms underlying RAG hallucinations and find they arise when later-layer FFN modules disproportionately inject parametric knowledge into the residual stream.<n>Our results highlight mechanistic signals as efficient, general predictorsizable for hallucination detection in RAG systems.
- Score: 4.038581147264715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge to mitigate hallucinations, yet models often generate outputs inconsistent with retrieved content. Accurate hallucination detection requires disentangling the contributions of external context and parametric knowledge, which prior methods typically conflate. We investigate the mechanisms underlying RAG hallucinations and find they arise when later-layer FFN modules disproportionately inject parametric knowledge into the residual stream. To address this, we explore a mechanistic detection approach based on external context scores and parametric knowledge scores. Using Qwen3-0.6b, we compute these scores across layers and attention heads and train regression-based classifiers to predict hallucinations. Our method is evaluated against state-of-the-art LLMs (GPT-5, GPT-4.1) and detection baselines (RAGAS, TruLens, RefChecker). Furthermore, classifiers trained on Qwen3-0.6b signals generalize to GPT-4.1-mini responses, demonstrating the potential of proxy-model evaluation. Our results highlight mechanistic signals as efficient, generalizable predictors for hallucination detection in RAG systems.
Related papers
- LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals [7.61196995380844]
Retrieval-Augmented Generation (RAG) aims to mitigate hallucinations in large language models (LLMs) by grounding responses in retrieved documents.<n>Yet, RAG-based LLMs still hallucinate even when provided with correct and sufficient context.<n>We propose LUMINA, a novel framework that detects hallucinations in RAG systems through context-knowledge signals.
arXiv Detail & Related papers (2025-09-26T04:57:46Z) - MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them [52.764019220214344]
Hallucinations pose critical risks for large language model (LLM)-based agents.<n>We present MIRAGE-Bench, the first unified benchmark for eliciting and evaluating hallucinations in interactive environments.
arXiv Detail & Related papers (2025-07-28T17:38:29Z) - SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion [2.7064617166078087]
Empirical studies demonstrate that the disequilibrium between external contextual information and internal parametric knowledge constitutes a primary factor in hallucination generation.<n>The recently proposed ReDeEP framework decouples these dual mechanisms.<n>This paper introduces SEReDeEP, which enhances computational processes through semantic entropy captured via trained linear probes.
arXiv Detail & Related papers (2025-05-12T13:10:46Z) - Osiris: A Lightweight Open-Source Hallucination Detection System [30.63248848082757]
hallucinations prevent RAG systems from being deployed in production environments.<n>We introduce a multi-hop QA dataset with induced hallucinations.<n>We achieve better recall with a 7B model than GPT-4o on the RAGTruth hallucination detection benchmark.
arXiv Detail & Related papers (2025-05-07T22:45:59Z) - Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [78.78822033285938]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.<n>In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z) - Hallucination Detection in LLMs with Topological Divergence on Attention Graphs [60.83579255387347]
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models.<n>We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting.
arXiv Detail & Related papers (2025-04-14T10:06:27Z) - ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability [27.325766792146936]
hallucinations caused by insufficient parametric (internal) knowledge.<n>Detecting such hallucinations requires disentangling how Large Language Models (LLMs) utilize external and parametric knowledge.<n>We propose ReDeEP, a novel method that detects hallucinations by decoupling LLM's utilization of external context and parametric knowledge.
arXiv Detail & Related papers (2024-10-15T09:02:09Z) - LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation [2.7301965149124032]
In this paper, we propose LRP4RAG, a method for detecting hallucinations in large language models (LLMs)<n>To the best of our knowledge, this is the first time that LRP has been used for detecting RAG hallucinations.
arXiv Detail & Related papers (2024-08-28T04:44:43Z) - Rowen: Adaptive Retrieval-Augmented Generation for Hallucination Mitigation in LLMs [88.75700174889538]
Hallucinations present a significant challenge for large language models (LLMs)<n>The utilization of parametric knowledge in generating factual content is constrained by the limited knowledge of LLMs.<n>We present Rowen, a novel framework that enhances LLMs with an adaptive retrieval augmentation process tailored to address hallucinated outputs.
arXiv Detail & Related papers (2024-02-16T11:55:40Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - Detecting Hallucinated Content in Conditional Neural Sequence Generation [165.68948078624499]
We propose a task to predict whether each token in the output sequence is hallucinated (not contained in the input)
We also introduce a method for learning to detect hallucinations using pretrained language models fine tuned on synthetic data.
arXiv Detail & Related papers (2020-11-05T00:18:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.