DRIFT: Detecting Representational Inconsistencies for Factual Truthfulness
- URL: http://arxiv.org/abs/2601.14210v2
- Date: Thu, 29 Jan 2026 15:25:01 GMT
- Title: DRIFT: Detecting Representational Inconsistencies for Factual Truthfulness
- Authors: Rohan Bhatnagar, Youran Sun, Chi Andrew Zhang, Yixin Wen, Haizhao Yang,
- Abstract summary: LLMs often produce fluent but incorrect answers, yet detecting such hallucinations typically requires multiple sampling passes or post-hoc verification.<n>We propose a lightweight probe to read these signals directly from hidden states.<n>We develop an LLM router that answers confident queries immediately while delegating uncertain ones to stronger models.
- Score: 5.785021425715989
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LLMs often produce fluent but incorrect answers, yet detecting such hallucinations typically requires multiple sampling passes or post-hoc verification, adding significant latency and cost. We hypothesize that intermediate layers encode confidence signals that are lost in the final output layer, and propose a lightweight probe to read these signals directly from hidden states. The probe adds less than 0.1\% computational overhead and can run fully in parallel with generation, enabling hallucination detection before the answer is produced. Building on this, we develop an LLM router that answers confident queries immediately while delegating uncertain ones to stronger models. Despite its simplicity, our method achieves SOTA AUROC on 10 out of 12 settings across four QA benchmarks and three LLM families, with gains of up to 13 points over prior methods, and generalizes across dataset shifts without retraining.
Related papers
- Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models [58.946955321428845]
This work presents self-rewarding sequential Monte Carlo (SMC)<n>Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy.<n>We introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights.
arXiv Detail & Related papers (2026-02-02T09:21:45Z) - EDIT: Early Diffusion Inference Termination for dLLMs Based on Dynamics of Training Gradients [6.736735746633275]
Diffusion-based large language models (dLLMs) refine token generations through iterative denoising, but answers often stabilize before all steps complete.<n>We propose EDIT, an inference-time criterion that adaptively stops denoising once sufficient reasoning stability relative to training-time reasoning is detected.
arXiv Detail & Related papers (2025-11-29T23:47:47Z) - Reading Between the Lines: Abstaining from VLM-Generated OCR Errors via Latent Representation Probes [79.36545159724703]
We propose Latent Representation Probing (LRP) to train lightweight probes on hidden states or attention patterns.<n>LRP improves abstention accuracy by 7.6% over best baselines.<n>This establishes a principled framework for building deployment-ready AI systems.
arXiv Detail & Related papers (2025-11-25T00:24:42Z) - SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs [43.76748192880071]
This paper presents a principled UQ framework that quantifies the inherent semantic uncertainty of large language models.<n>We develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies.<n>We then exploit latent semantic structural information through hierarchical abstraction.
arXiv Detail & Related papers (2025-11-20T11:54:12Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs [23.900061215331338]
We show that question ambiguity is linearly encoded in the internal representations of large language models (LLMs)<n>We show that LLMs form compact internal representations of question ambiguity, enabling interpretable and controllable behavior.
arXiv Detail & Related papers (2025-09-17T03:34:35Z) - Cross-Layer Attention Probing for Fine-Grained Hallucination Detection [6.83291363146574]
We propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection.<n>Our empirical evaluations show that CLAP improves hallucination detection compared to baselines on both decoded responses and responses sampled at higher temperatures.<n>CLAP maintains high reliability even when applied out-of-distribution.
arXiv Detail & Related papers (2025-09-04T14:37:34Z) - Diffusion Language Models Know the Answer Before Decoding [56.96815863705218]
Diffusion language models (DLMs) have emerged as an alternative to autoregressive approaches.<n>Our work highlights and leverage an overlooked property of DLMs early answer convergence.<n>We introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding.
arXiv Detail & Related papers (2025-08-27T15:40:25Z) - Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models [24.72990207218907]
Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation.<n>We investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses.
arXiv Detail & Related papers (2025-08-11T16:12:36Z) - MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them [52.764019220214344]
Hallucinations pose critical risks for large language model (LLM)-based agents.<n>We present MIRAGE-Bench, the first unified benchmark for eliciting and evaluating hallucinations in interactive environments.
arXiv Detail & Related papers (2025-07-28T17:38:29Z) - Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions [60.43398881149664]
We introduce LOS-Net, a lightweight attention-based architecture trained on an efficient encoding of the LLM Output Signature.<n>It achieves superior performance across diverse benchmarks and LLMs, while maintaining extremely low detection latency.
arXiv Detail & Related papers (2025-03-18T09:04:37Z) - LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation [52.58791563814837]
Large foundation models trained on large-scale vision-language data can boost Open-Vocabulary Object Detection (OVD)<n>This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the Large Language Models (LLMs)<n>We find that intermediate LLM layers already encode rich spatial semantics; adapting only the early layers yields most of the gain.
arXiv Detail & Related papers (2025-03-18T00:50:40Z) - HalluCounter: Reference-free LLM Hallucination Detection in the Wild! [6.5037356041929675]
HalluCounter is a reference-free hallucination detection method that utilizes both response-response and query-response consistency and alignment patterns.<n>Our method outperforms state-of-the-art approaches by a significant margin, achieving over 90% average confidence in hallucination detection across datasets.
arXiv Detail & Related papers (2025-03-06T16:59:18Z) - Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Ambiguous Prompts and Unanswerable Questions [60.31496362993982]
Large language models (LLMs) frequently generate confident yet inaccurate responses.<n>We present a novel, test-time approach to detecting model hallucination through systematic analysis of information flow.
arXiv Detail & Related papers (2024-12-13T16:14:49Z) - Mitigating LLM Hallucinations via Conformal Abstention [70.83870602967625]
We develop a principled procedure for determining when a large language model should abstain from responding in a general domain.
We leverage conformal prediction techniques to develop an abstention procedure that benefits from rigorous theoretical guarantees on the hallucination rate (error rate)
Experimentally, our resulting conformal abstention method reliably bounds the hallucination rate on various closed-book, open-domain generative question answering datasets.
arXiv Detail & Related papers (2024-04-04T11:32:03Z) - INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection [39.52923659121416]
We propose to explore the dense semantic information retained within textbfINternal textbfStates for halluctextbfInation textbfDEtection.
A simple yet effective textbfEigenScore metric is proposed to better evaluate responses' self-consistency.
A test time feature clipping approach is explored to truncate extreme activations in the internal states.
arXiv Detail & Related papers (2024-02-06T06:23:12Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.