Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection
- URL: http://arxiv.org/abs/2508.21228v1
- Date: Thu, 28 Aug 2025 21:39:53 GMT
- Title: Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection
- Authors: Weizhi Gao, Xiaorui Liu, Feiyi Wang, Dan Lu, Junqi Yin,
- Abstract summary: We propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and decoding.<n>Our method achieves up to a 3x speedup without sacrificing AUROC performance.
- Score: 17.792828844969033
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.
Related papers
- Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z) - HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding [48.55833840968632]
Speculative decoding has emerged as a promising approach to accelerate LLM inference without sacrificing output quality.<n>We propose HIPPO, a holistic-aware parallel speculative decoding framework.<n> experiments on four video-LLMs across six benchmarks demonstrate HIPPO's effectiveness, yielding up to 3.51x speedup.
arXiv Detail & Related papers (2026-01-13T07:02:43Z) - Beyond ROUGE: N-Gram Subspace Features for LLM Hallucination Detection [5.0106565473767075]
Large Language Models (LLMs) have demonstrated effectiveness across a wide variety of tasks involving natural language.<n>A fundamental problem of hallucinations still plagues these models, limiting their trustworthiness in generating consistent, truthful information.<n>We propose a novel approach inspired by ROUGE that constructs an N-Gram frequency tensor from LLM-generated text.<n>This tensor captures richer semantic structure by encoding co-occurrence patterns, enabling better differentiation between factual and hallucinated content.
arXiv Detail & Related papers (2025-09-03T18:52:24Z) - ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models [67.75439511654078]
Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.<n>They face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications.<n>We propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment.
arXiv Detail & Related papers (2025-07-01T16:01:08Z) - ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation [74.37307916314407]
We propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely.<n>Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning.
arXiv Detail & Related papers (2025-06-23T16:20:44Z) - Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps [16.84310001807895]
This paper introduces a model-agnostic approach that can be applied to A-RAG methods.<n>Specifically, we use cache access and parallel generation to speed up the prefilling and decoding stages respectively.
arXiv Detail & Related papers (2025-05-19T05:39:38Z) - Monitoring Decoding: Mitigating Hallucination via Evaluating the Factuality of Partial Response during Generation [9.137042895376343]
Large language models are susceptible to hallucinations, generating plausible yet factually incorrect contents.<n>Existing methods to mitigate such risk often rely on sampling multiple full-length generations.<n>We introduce Monitoring Decoding, a novel framework that dynamically monitors the generation process.
arXiv Detail & Related papers (2025-03-05T01:51:03Z) - Improving Factuality with Explicit Working Memory [68.39261790277615]
Large language models can generate factually inaccurate content, a problem known as hallucination.<n>We introduce EWE (Explicit Working Memory), a novel approach that enhances factuality in long-form text generation by integrating a working memory that receives real-time feedback from external resources.
arXiv Detail & Related papers (2024-12-24T00:55:59Z) - Path-Consistency: Prefix Enhancement for Efficient Inference in LLM [3.309813585671485]
textitpath-consistency mitigates both the errors and redundancies from random or less useful sampling in self-consistency.<n>textitpath-consistency achieves significant acceleration in inference latency ranging from $7.8%$ to $40.5%$.
arXiv Detail & Related papers (2024-08-25T01:45:53Z) - Detecting, Explaining, and Mitigating Memorization in Diffusion Models [49.438362005962375]
We introduce a straightforward yet effective method for detecting memorized prompts by inspecting the magnitude of text-conditional predictions.
Our proposed method seamlessly integrates without disrupting sampling algorithms, and delivers high accuracy even at the first generation step.
Building on our detection strategy, we unveil an explainable approach that shows the contribution of individual words or tokens to memorization.
arXiv Detail & Related papers (2024-07-31T16:13:29Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks.
However, their large size makes their inference slow and computationally expensive.
We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.