Related papers: A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations

A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations

URL: http://arxiv.org/abs/2507.23221v1
Date: Thu, 31 Jul 2025 03:26:57 GMT
Title: A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations
Authors: Charles O'Neill, Slava Chalnev, Chi Chi Zhao, Max Kirkby, Mudith Jayasekara,
Abstract summary: A generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream.<n>This probe isolates a single, linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Contextual hallucinations -- statements unsupported by given context -- remain a significant challenge in AI. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. This probe isolates a single, transferable linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points and showing robust mid-layer performance across Gemma-2 models (2B to 27B). Gradient-times-activation localises this signal to sparse, late-layer MLP activity. Critically, manipulating this direction causally steers generator hallucination rates, proving its actionability. Our results offer novel evidence of internal, low-dimensional hallucination tracking linked to specific MLP sub-circuits, exploitable for detection and mitigation. We release the 2000-example ContraTales benchmark for realistic assessment of such solutions.

Related papers

Hallucination Begins Where Saliency Drops [18.189047289404325]
hallucinations frequently arise when preceding output tokens exhibit low saliency toward the prediction of the next token.<n>We introduce LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token.<n>Our method significantly reduces hallucination rates while preserving fluency and task performance, offering a robust and interpretable solution.
arXiv Detail & Related papers (2026-01-28T05:50:52Z)
A novel hallucination classification framework [0.0]
This work introduces a novel methodology for the automatic detection of hallucinations generated during large language model (LLM) inference.<n>The proposed approach is based on a systematic taxonomy and controlled reproduction of diverse hallucination types through prompt engineering.
arXiv Detail & Related papers (2025-10-06T09:54:20Z)
Mitigating Hallucinations in Large Vision-Language Models by Self-Injecting Hallucinations [73.37711261605271]
hallucination mitigation methods are mainly based on preference alignment and require external human annotations or auxiliary models for preference data collection.<n>We propose Autonomous Preference Alignment via Self-Injection (APASI), a novel and generalizable method that mitigates hallucinations without external dependencies.<n>APASI leverages the target LVLM to self-inject hallucinations into a generated response, creating a pair of responses with varying preference levels.
arXiv Detail & Related papers (2025-09-14T14:26:53Z)
SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs [52.03164192840023]
Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge.<n>We propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data.<n>We construct SHALE, a benchmark designed to assess both faithfulness and factuality hallucinations.
arXiv Detail & Related papers (2025-08-13T07:58:01Z)
Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models [0.0]
We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in large language models.<n>Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations.
arXiv Detail & Related papers (2025-08-03T17:29:48Z)
ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs [50.18087419133284]
hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations.<n>We introduce a novel metric, the ICR Score, which quantifies the contribution of modules to the hidden states' update.<n>We propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states.
arXiv Detail & Related papers (2025-07-22T11:44:26Z)
HIDE and Seek: Detecting Hallucinations in Language Models via Decoupled Representations [17.673293240849787]
Contemporary Language Models (LMs) often generate content that is factually incorrect or unfaithful to the input context.<n>We propose a single-pass, training-free approach for effective Hallucination detectIon via Decoupled rEpresentations (HIDE)<n>Our results demonstrate that HIDE outperforms other single-pass methods in almost all settings.
arXiv Detail & Related papers (2025-06-21T16:02:49Z)
Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs [47.18623962083962]
We present a novel approach for detecting hallucinations in large language models.<n>We find that hallucinated responses exhibit smaller deviations from their prompts compared to grounded responses.<n>We propose a model-intrinsic detection method that uses distributional distances as principled hallucination scores.
arXiv Detail & Related papers (2025-06-11T15:59:15Z)
Shaking to Reveal: Perturbation-Based Detection of LLM Hallucinations [25.18901449626428]
A widely adopted strategy to detect hallucination, known as self-assessment, relies on the model's own output confidence to estimate the factual accuracy of its answers.<n>We propose Sample-Specific Prompting (SSP), a new framework that improves self-assessment by analyzing perturbation sensitivity at intermediate representations.<n>SSP significantly outperforms prior methods across a range of hallucination detection benchmarks.
arXiv Detail & Related papers (2025-06-03T09:44:28Z)
MIRAGE: Assessing Hallucination in Multimodal Reasoning Chains of MLLM [58.2298313720146]
Multimodal hallucinations are multi-sourced and arise from diverse causes.<n>Existing benchmarks fail to adequately distinguish between perception-induced hallucinations and reasoning-induced hallucinations.
arXiv Detail & Related papers (2025-05-30T05:54:36Z)
HalluLens: LLM Hallucination Benchmark [49.170128733508335]
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination"<n>This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks.
arXiv Detail & Related papers (2025-04-24T13:40:27Z)
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [67.14942827452161]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.<n>In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z)
Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations [82.42811602081692]
This paper introduces a subsequence association framework to systematically trace and understand hallucinations.<n>Key insight is hallucinations that arise when dominant hallucinatory associations outweigh faithful ones.<n>We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts.
arXiv Detail & Related papers (2025-04-17T06:34:45Z)
Hallucination Detection in LLMs with Topological Divergence on Attention Graphs [64.74977204942199]
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models.<n>We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting.
arXiv Detail & Related papers (2025-04-14T10:06:27Z)
Robust Hallucination Detection in LLMs via Adaptive Token Selection [25.21763722332831]
Hallucinations in large language models (LLMs) pose significant safety concerns that impede their broader deployment.<n>We propose HaMI, a novel approach that enables robust detection of hallucinations through adaptive selection and learning of critical tokens.<n>We achieve this robustness by an innovative formulation of the Hallucination detection task as Multiple Instance (HaMI) learning over token-level representations within a sequence.
arXiv Detail & Related papers (2025-04-10T15:39:10Z)
Alleviating Hallucinations of Large Language Models through Induced Hallucinations [67.35512483340837]
Large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information. We propose a simple textitInduce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations.
arXiv Detail & Related papers (2023-12-25T12:32:49Z)
A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks. We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion. We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.