The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems
- URL: http://arxiv.org/abs/2512.15068v2
- Date: Thu, 18 Dec 2025 21:43:59 GMT
- Title: The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems
- Authors: Debu Sinha,
- Abstract summary: We apply hallucination prediction to RAG detection, transforming scores into decision sets with finite-sample coverage guarantees.<n>We analyze this failure through the lens of distributional tails, showing that while NLI models achieve acceptable AUC (0.81), the "hardest" hallucinations are semantically indistinguishable from faithful responses.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. While current detection methods leverage embedding similarity and natural language inference (NLI), their reliability in safety-critical settings remains unproven. We apply conformal prediction to RAG hallucination detection, transforming heuristic scores into decision sets with finite-sample coverage guarantees (1-alpha). Using calibration sets of n=600, we demonstrate a fundamental dichotomy: on synthetic hallucinations (Natural Questions), embedding methods achieve 95% coverage with 0% False Positive Rate (FPR). However, on real hallucinations from RLHF-aligned models (HaluEval), the same methods fail catastrophically, yielding 100% FPR at target coverage. We analyze this failure through the lens of distributional tails, showing that while NLI models achieve acceptable AUC (0.81), the "hardest" hallucinations are semantically indistinguishable from faithful responses, forcing conformal thresholds to reject nearly all valid outputs. Crucially, GPT-4 as a judge achieves 7% FPR (95% CI:[3.4%, 13.7%]) on the same data, proving the task is solvable via reasoning but opaque to surface-level semantics--a phenomenon we term the "Semantic Illusion."
Related papers
- Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology [0.0]
We propose a training free guardrail based on spectral analysis of attention topology that complements supervised approaches.<n>On Llama 3.1 8B, our method achieves 97.7% recall with multi-feature detection and 86.1% recall with 81.0% precision for balanced deployment.<n>We reveal the Loud Liar'' phenomenon: Llama 3.1 8B's failures are spectrally catastrophic and dramatically easier to detect, while Mistral 7B achieves the best discrimination.
arXiv Detail & Related papers (2026-02-08T18:56:16Z) - Detecting AI Hallucinations in Finance: An Information-Theoretic Method Cuts Hallucination Rate by 92% [4.693270291878929]
Large language models (LLMs) produce fluent but unsupported answers - hallucinations.<n>We propose ECLIPSE, a framework that treats hallucination as a mismatch between a model's semantic entropy and the capacity of available evidence.
arXiv Detail & Related papers (2025-12-02T05:25:48Z) - HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems in the Legal Domain [28.691566712713808]
Large Language Models (LLMs) are widely used in industry but remain prone to hallucinations, limiting their reliability in critical applications.<n>This work addresses hallucination reduction in consumer grievance chatbots built using LLaMA 3.1 8B Instruct, a compact model frequently used in industry.<n>We develop HalluDetect, an LLM-based hallucination detection system that achieves an F1 score of 68.92% outperforming baseline detectors by 22.47%.
arXiv Detail & Related papers (2025-09-15T06:23:36Z) - Unsupervised Hallucination Detection by Inspecting Reasoning Processes [53.15199932086543]
Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data.<n>We propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness.<n>Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
arXiv Detail & Related papers (2025-09-12T06:58:17Z) - Semantic Energy: Detecting LLM Hallucination Beyond Entropy [106.92072182161712]
Large Language Models (LLMs) are being increasingly deployed in real-world applications, but they remain susceptible to hallucinations.<n>Uncertainty estimation is a feasible approach to detect such hallucinations.<n>We introduce Semantic Energy, a novel uncertainty estimation framework.
arXiv Detail & Related papers (2025-08-20T07:33:50Z) - Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models [0.0]
We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in large language models.<n>Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations.
arXiv Detail & Related papers (2025-08-03T17:29:48Z) - ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs [50.18087419133284]
hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations.<n>We introduce a novel metric, the ICR Score, which quantifies the contribution of modules to the hidden states' update.<n>We propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states.
arXiv Detail & Related papers (2025-07-22T11:44:26Z) - RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection [26.186204911845866]
hallucinations remain a vital obstacle to large language models' trustworthy use.<n>We propose RePPL to recalibrate uncertainty measurement by these two aspects.<n>Our method achieves the best comprehensive detection performance across various QA datasets.
arXiv Detail & Related papers (2025-05-21T11:23:05Z) - Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [78.78822033285938]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.<n>In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z) - Detecting Hallucination and Coverage Errors in Retrieval Augmented Generation for Controversial Topics [16.874364446070967]
We explore a strategy to handle controversial topics in LLM-based chatbots based on Wikipedia's Neutral Point of View (NPOV) principle.
We use a deterministic retrieval system and then focus on common LLM failure modes that arise during this approach to text generation, namely hallucination and coverage errors.
We show that our methods still yield good results on hallucination (84.0%) and coverage error (85.2%) detection.
arXiv Detail & Related papers (2024-03-13T18:47:00Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of
LLMs by Validating Low-Confidence Generation [76.34411067299331]
Large language models often tend to 'hallucinate' which critically hampers their reliability.
We propose an approach that actively detects and mitigates hallucinations during the generation process.
We show that the proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average.
arXiv Detail & Related papers (2023-07-08T14:25:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.