Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping
- URL: http://arxiv.org/abs/2601.17467v1
- Date: Sat, 24 Jan 2026 13:47:51 GMT
- Title: Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping
- Authors: Jianxiong Zhang, Bing Guo, Yuming Jiang, Haobo Wang, Bo An, Xuefeng Du,
- Abstract summary: We introduce Answer-agreement Representation Shaping (ARS), which learns detection-friendly trace-conditioned representations.<n>ARS generates counterfactual answers through small latent interventions.<n>ARS consistently improves detection and achieves substantial gains over strong baselines.
- Score: 31.704726867711955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large reasoning models (LRMs) often generate long, seemingly coherent reasoning traces yet still produce incorrect answers, making hallucination detection challenging. Although trajectories contain useful signals, directly using trace text or vanilla hidden states for detection is brittle: traces vary in form and detectors can overfit to superficial patterns rather than answer validity. We introduce Answer-agreement Representation Shaping (ARS), which learns detection-friendly trace-conditioned representations by explicitly encoding answer stability. ARS generates counterfactual answers through small latent interventions, specifically, perturbing the trace-boundary embedding, and labels each perturbation by whether the resulting answer agrees with the original. It then learns representations that bring answer-agreeing states together and separate answer-disagreeing ones, exposing latent instability indicative of hallucination risk. The shaped embeddings are plug-and-play with existing embedding-based detectors and require no human annotations during training. Experiments demonstrate that ARS consistently improves detection and achieves substantial gains over strong baselines.
Related papers
- Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models [26.89705770151822]
Hallucination in large language models (LLMs) is a fundamental challenge, particularly in open-domain question answering.<n>Prior work attempts to detect hallucination with model-internal signals such as token-level entropy or generation consistency.<n>We investigate whether data coverage itself can serve as a detection signal.
arXiv Detail & Related papers (2025-11-22T06:59:55Z) - Unsupervised Hallucination Detection by Inspecting Reasoning Processes [53.15199932086543]
Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data.<n>We propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness.<n>Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
arXiv Detail & Related papers (2025-09-12T06:58:17Z) - A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations [0.0]
A generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream.<n>This probe isolates a single, linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points.
arXiv Detail & Related papers (2025-07-31T03:26:57Z) - Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation [9.540386616651295]
Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning.<n>Our study highlights an overlooked trade-off in the use of reasoning.
arXiv Detail & Related papers (2025-06-20T15:49:37Z) - Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs [47.18623962083962]
We present a novel approach for detecting hallucinations in large language models.<n>We find that hallucinated responses exhibit smaller deviations from their prompts compared to grounded responses.<n>We propose a model-intrinsic detection method that uses distributional distances as principled hallucination scores.
arXiv Detail & Related papers (2025-06-11T15:59:15Z) - Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models [12.270274049887298]
Reasoning traces can be redundant or logically inconsistent, making them a new source of hallucination.<n>Existing hallucination detection methods focus primarily on answer-level uncertainty.<n>We propose RACE, a novel framework specifically tailored for hallucination detection in LRMs.
arXiv Detail & Related papers (2025-06-05T09:54:04Z) - Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations [82.42811602081692]
This paper introduces a subsequence association framework to systematically trace and understand hallucinations.<n>Key insight is hallucinations that arise when dominant hallucinatory associations outweigh faithful ones.<n>We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts.
arXiv Detail & Related papers (2025-04-17T06:34:45Z) - Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps [48.58310785625051]
Large language models (LLMs) can hallucinate details and respond with unsubstantiated answers.
This paper describes a simple approach for detecting such contextual hallucinations.
arXiv Detail & Related papers (2024-07-09T17:44:34Z) - Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks.
They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences.
We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z) - Towards Mitigating Hallucination in Large Language Models via
Self-Reflection [63.2543947174318]
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks.
This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets.
arXiv Detail & Related papers (2023-10-10T03:05:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.