Unsupervised Hallucination Detection by Inspecting Reasoning Processes
- URL: http://arxiv.org/abs/2509.10004v1
- Date: Fri, 12 Sep 2025 06:58:17 GMT
- Title: Unsupervised Hallucination Detection by Inspecting Reasoning Processes
- Authors: Ponhvoan Srey, Xiaobao Wu, Anh Tuan Luu,
- Abstract summary: Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data.<n>We propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness.<n>Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
- Score: 53.15199932086543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
Related papers
- Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping [31.704726867711955]
We introduce Answer-agreement Representation Shaping (ARS), which learns detection-friendly trace-conditioned representations.<n>ARS generates counterfactual answers through small latent interventions.<n>ARS consistently improves detection and achieves substantial gains over strong baselines.
arXiv Detail & Related papers (2026-01-24T13:47:51Z) - Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning [11.28752240109815]
Large language models continually evolve through pre-training on ever-expanding web data.<n>This adaptive process also exposes them to subtle forms of misinformation.<n>We investigate whether repeated exposure to false but confidently stated facts can shift a model's internal representation away from the truth.
arXiv Detail & Related papers (2025-10-29T14:35:03Z) - Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models [0.0]
We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in large language models.<n>Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations.
arXiv Detail & Related papers (2025-08-03T17:29:48Z) - When Truthful Representations Flip Under Deceptive Instructions? [24.004146630216685]
Large language models (LLMs) tend to follow maliciously crafted instructions to generate deceptive responses.<n>Deceptive instructions alter the internal representations of LLM compared to truthful ones.<n>Our analysis pinpoints layer-wise and feature-level correlates of instructed dishonesty.
arXiv Detail & Related papers (2025-07-29T18:27:13Z) - Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs [129.79394562739705]
Large language models (LLMs) exhibit impressive fluency, but often produce critical errors known as "hallucinations"<n>We propose RAUQ (Recurrent Attention-based Uncertainty Quantification), an unsupervised approach that leverages intrinsic attention patterns in transformers to detect hallucinations efficiently.<n> Experiments across 4 LLMs and 12 question answering, summarization, and translation tasks demonstrate that RAUQ yields excellent results.
arXiv Detail & Related papers (2025-05-26T14:28:37Z) - Semantic Volume: Quantifying and Detecting both External and Internal Uncertainty in LLMs [13.982395477368396]
Large language models (LLMs) have demonstrated remarkable performance across diverse tasks by encoding vast amounts of factual knowledge.<n>They are still prone to hallucinations, generating incorrect or misleading information, often accompanied by high uncertainty.<n>We introduce Semantic Volume, a novel measure for quantifying both external and internal uncertainty in LLMs.
arXiv Detail & Related papers (2025-02-28T17:09:08Z) - Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification [116.77055746066375]
Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output.
We propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification.
arXiv Detail & Related papers (2024-03-07T17:44:17Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - Delphic Offline Reinforcement Learning under Nonidentifiable Hidden
Confounding [10.315867984674032]
We propose a definition of uncertainty due to hidden confounding bias, termed delphic uncertainty.
We derive a practical method for estimating the three types of uncertainties, and construct a pessimistic offline RL algorithm to account for them.
Our results suggest that nonidentifiable hidden confounding bias can be mitigated to improve offline RL solutions in practice.
arXiv Detail & Related papers (2023-06-01T21:27:22Z) - RATT: Leveraging Unlabeled Data to Guarantee Generalization [96.08979093738024]
We introduce a method that leverages unlabeled data to produce generalization bounds.
We prove that our bound is valid for 0-1 empirical risk minimization.
This work provides practitioners with an option for certifying the generalization of deep nets even when unseen labeled data is unavailable.
arXiv Detail & Related papers (2021-05-01T17:05:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.