Related papers: Hallucination Detection and Evaluation of Large Language Model

Hallucination Detection and Evaluation of Large Language Model

URL: http://arxiv.org/abs/2512.22416v1
Date: Sat, 27 Dec 2025 00:17:03 GMT
Title: Hallucination Detection and Evaluation of Large Language Model
Authors: Chenggong Zhang, Haopeng Wang,
Abstract summary: Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content.<n>Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs.<n>To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework.
Score: 0.26856688022781555
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy \(82.2\%\) and TPR \(78.9\%\). However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.

Related papers

Small Updates, Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ? [17.099852012707476]
We systematically investigate the impact of PEFT on hallucination detection through a comprehensive empirical study.<n>Experiments show that PEFT consistently strengthens hallucination detection ability.<n>Further analyses indicate that PEFT methods primarily reshapes how uncertainty is encoded and surfaced.
arXiv Detail & Related papers (2026-01-17T21:39:24Z)
Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models [49.435669307386156]
Multi-stage Prompt Refinement (MPR) is a framework designed to systematically improve ill-formed prompts across multiple stages.<n>MPR iteratively enhances the clarity of prompts with additional context and employs a self-reflection mechanism with ranking to prioritize the most relevant input.<n>Results on hallucination benchmarks show that MPR achieve over an 85% win rate compared to their original forms.
arXiv Detail & Related papers (2025-10-14T00:31:36Z)
Bolster Hallucination Detection via Prompt-Guided Data Augmentation [33.98592618879001]
We introduce Prompt-guided data Augmented haLlucination dEtection (PALE) as data augmentation for hallucination detection.<n>This framework can generate both truthful and hallucinated data under prompt guidance at a relatively low cost.<n>In experiments, PALE achieves superior hallucination detection performance, outperforming the competitive baseline by a significant margin of 6.55%.
arXiv Detail & Related papers (2025-10-13T02:06:15Z)
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs [10.103648327848763]
Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment.<n>Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments.<n>We argue that adopting semantically aware and robust evaluation frameworks is essential to accurately gauge the true performance of hallucination detection methods.
arXiv Detail & Related papers (2025-08-01T20:34:01Z)
ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs [50.18087419133284]
hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations.<n>We introduce a novel metric, the ICR Score, which quantifies the contribution of modules to the hidden states' update.<n>We propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states.
arXiv Detail & Related papers (2025-07-22T11:44:26Z)
Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs [5.161416961439468]
This study proposes an effective uncertainty estimation approach, textbfClusttextbfering-based semtextbfantic contextbfsisttextbfency (textbfCleanse)<n>The effectiveness of Cleanse for detecting hallucination is validated using four off-the-shelf models, LLaMA-7B, LLaMA-13B, LLaMA2-7B and Mistral-7B.
arXiv Detail & Related papers (2025-07-19T14:48:24Z)
REFIND at SemEval-2025 Task 3: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models [15.380441563675243]
REFIND (Retrieval-augmented Factuality hallucINation Detection) is a novel framework that detects hallucinated spans within large language model (LLM) outputs.<n>We propose the Context Sensitivity Ratio (CSR), a novel metric that quantifies the sensitivity of LLM outputs to retrieved evidence.<n> REFIND demonstrated robustness across nine languages, including low-resource settings, and significantly outperformed baseline models.
arXiv Detail & Related papers (2025-02-19T10:59:05Z)
HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses [0.12499537119440242]
This paper proposes an explanation enhanced hallucination-detection model, coined as HuDEx.<n>The proposed model provides a novel approach to integrate detection with explanations, and enable both users and the LLM itself to understand and reduce errors.
arXiv Detail & Related papers (2025-02-12T04:17:02Z)
Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models [67.89204055004028]
Large Vision-Language Models (LVLMs) have been plagued by the issue of hallucination. Previous works have proposed a series of benchmarks featuring different types of tasks and evaluation metrics. We propose a Hallucination benchmark Quality Measurement framework (HQM) to assess the reliability and validity of existing hallucination benchmarks.
arXiv Detail & Related papers (2024-06-24T20:08:07Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z)
A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks. We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion. We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.