Related papers: FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs

FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs

URL: http://arxiv.org/abs/2410.02899v2
Date: Tue, 24 Jun 2025 19:49:48 GMT
Title: FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs
Authors: Deema Alnuhait, Neeraja Kirtane, Muhammad Khalifa, Hao Peng,
Abstract summary: We introduce FactCheckmate, which preemptively detects hallucinations by learning a classifier.<n>If a hallucination is detected, FactCheckmate then intervenes by adjusting the LM's hidden states.<n>Our results demonstrate the effectiveness of FactCheckmate, achieving over 70% preemptive detection accuracy.
Score: 21.767886997853022
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models (LMs) hallucinate. We inquire: Can we detect and mitigate hallucinations before they happen? This work answers this research question in the positive, by showing that the internal representations of LMs provide rich signals that can be used for this purpose. We introduce FactCheckmate, which preemptively detects hallucinations by learning a classifier that predicts whether the LM will hallucinate, based on the model's hidden states produced over the inputs, before decoding begins. If a hallucination is detected, FactCheckmate then intervenes by adjusting the LM's hidden states such that the model will produce more factual outputs. FactCheckmate provides fresh insights that the inner workings of LMs can be revealed by their hidden states. Practically, both its detection and mitigation models are lightweight, adding little inference overhead; FactCheckmate proves a more efficient approach for mitigating hallucinations compared to many post-hoc alternatives. We evaluate FactCheckmate over LMs of different scales and model families (including Llama, Mistral, Qwen and Gemma), across a variety of QA datasets from different domains. Our results demonstrate the effectiveness of FactCheckmate, achieving over 70% preemptive detection accuracy. On average, outputs generated by LMs with intervention are 34.4% more factual compared to those without.

Related papers

Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models [0.0]
We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in large language models.<n>Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations.
arXiv Detail & Related papers (2025-08-03T17:29:48Z)
FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs [8.820670807424174]
Large Language Models (LLMs) frequently generate hallucinated content. We propose FactSelfCheck, a novel black-box sampling-based method that enables fine-grained fact-level detection. Our approach represents text as knowledge graphs consisting of facts in the form of triples.
arXiv Detail & Related papers (2025-03-21T15:32:24Z)
Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs [45.13670875211498]
Large Language Models (LLMs) often generate outputs that lack grounding in real-world facts, a phenomenon known as hallucinations.<n>We show that models can hallucinate with high certainty even when they have the correct knowledge.
arXiv Detail & Related papers (2025-02-18T15:46:31Z)
The HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM's Internal States [0.5573267589690007]
We focus on hallucinations involving information not used in training, which we determine by using recency to ensure the information emerged after a cut-off date. This study investigates these hallucinations by detecting them at sentence level using different internal states of various language models. Our results show that IAVs detect hallucinations as effectively as CEVs and reveal that answerable and unanswerable prompts are encoded differently as separate classifiers.
arXiv Detail & Related papers (2024-12-22T15:08:24Z)
Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability [83.0884072598828]
Hallucinations come in many forms, and there is no universally accepted definition. We focus on studying only those hallucinations where a correct answer appears verbatim in the training set. We find that for a fixed dataset, larger and longer-trained LMs hallucinate less. While we see detector size improves performance on fixed LM's outputs, we find an inverse relationship between the scale of the LM and the detectability of its hallucinations.
arXiv Detail & Related papers (2024-08-14T23:34:28Z)
Data-augmented phrase-level alignment for mitigating object hallucination [52.43197107069751]
Multimodal Large Language Models (MLLMs) often generate factually inaccurate information, referred to as hallucination. We introduce Data-augmented Phrase-level Alignment (DPA), a novel loss which can be applied to instruction-tuned off-the-shelf MLLMs to mitigate hallucinations.
arXiv Detail & Related papers (2024-05-28T23:36:00Z)
On Large Language Models' Hallucination with Regard to Known Facts [74.96789694959894]
Large language models are successful in answering factoid questions but are also prone to hallucination. We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics. Our study shed light on understanding the reasons for LLMs' hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating.
arXiv Detail & Related papers (2024-03-29T06:48:30Z)
Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations. We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms. We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z)
Alleviating Hallucinations of Large Language Models through Induced Hallucinations [67.35512483340837]
Large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information. We propose a simple textitInduce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations.
arXiv Detail & Related papers (2023-12-25T12:32:49Z)
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data [102.56792377624927]
hallucinations inherent in machine-generated data remain under-explored. We present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm. Our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA.
arXiv Detail & Related papers (2023-11-22T04:52:58Z)
A New Benchmark and Reverse Validation Method for Passage-level Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks. We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion. We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.