FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs
- URL: http://arxiv.org/abs/2410.02899v2
- Date: Tue, 24 Jun 2025 19:49:48 GMT
- Title: FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs
- Authors: Deema Alnuhait, Neeraja Kirtane, Muhammad Khalifa, Hao Peng,
- Abstract summary: We introduce FactCheckmate, which preemptively detects hallucinations by learning a classifier.<n>If a hallucination is detected, FactCheckmate then intervenes by adjusting the LM's hidden states.<n>Our results demonstrate the effectiveness of FactCheckmate, achieving over 70% preemptive detection accuracy.
- Score: 21.767886997853022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) hallucinate. We inquire: Can we detect and mitigate hallucinations before they happen? This work answers this research question in the positive, by showing that the internal representations of LMs provide rich signals that can be used for this purpose. We introduce FactCheckmate, which preemptively detects hallucinations by learning a classifier that predicts whether the LM will hallucinate, based on the model's hidden states produced over the inputs, before decoding begins. If a hallucination is detected, FactCheckmate then intervenes by adjusting the LM's hidden states such that the model will produce more factual outputs. FactCheckmate provides fresh insights that the inner workings of LMs can be revealed by their hidden states. Practically, both its detection and mitigation models are lightweight, adding little inference overhead; FactCheckmate proves a more efficient approach for mitigating hallucinations compared to many post-hoc alternatives. We evaluate FactCheckmate over LMs of different scales and model families (including Llama, Mistral, Qwen and Gemma), across a variety of QA datasets from different domains. Our results demonstrate the effectiveness of FactCheckmate, achieving over 70% preemptive detection accuracy. On average, outputs generated by LMs with intervention are 34.4% more factual compared to those without.
Related papers
- Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models [0.0]
We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in large language models.<n>Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations.
arXiv Detail & Related papers (2025-08-03T17:29:48Z) - FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs [8.820670807424174]
Large Language Models (LLMs) frequently generate hallucinated content.
We propose FactSelfCheck, a novel black-box sampling-based method that enables fine-grained fact-level detection.
Our approach represents text as knowledge graphs consisting of facts in the form of triples.
arXiv Detail & Related papers (2025-03-21T15:32:24Z) - Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs [45.13670875211498]
Large Language Models (LLMs) often generate outputs that lack grounding in real-world facts, a phenomenon known as hallucinations.<n>We show that models can hallucinate with high certainty even when they have the correct knowledge.
arXiv Detail & Related papers (2025-02-18T15:46:31Z) - The HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM's Internal States [0.5573267589690007]
We focus on hallucinations involving information not used in training, which we determine by using recency to ensure the information emerged after a cut-off date.
This study investigates these hallucinations by detecting them at sentence level using different internal states of various language models.
Our results show that IAVs detect hallucinations as effectively as CEVs and reveal that answerable and unanswerable prompts are encoded differently as separate classifiers.
arXiv Detail & Related papers (2024-12-22T15:08:24Z) - Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability [83.0884072598828]
Hallucinations come in many forms, and there is no universally accepted definition.
We focus on studying only those hallucinations where a correct answer appears verbatim in the training set.
We find that for a fixed dataset, larger and longer-trained LMs hallucinate less.
While we see detector size improves performance on fixed LM's outputs, we find an inverse relationship between the scale of the LM and the detectability of its hallucinations.
arXiv Detail & Related papers (2024-08-14T23:34:28Z) - Data-augmented phrase-level alignment for mitigating object hallucination [52.43197107069751]
Multimodal Large Language Models (MLLMs) often generate factually inaccurate information, referred to as hallucination.
We introduce Data-augmented Phrase-level Alignment (DPA), a novel loss which can be applied to instruction-tuned off-the-shelf MLLMs to mitigate hallucinations.
arXiv Detail & Related papers (2024-05-28T23:36:00Z) - On Large Language Models' Hallucination with Regard to Known Facts [74.96789694959894]
Large language models are successful in answering factoid questions but are also prone to hallucination.
We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics.
Our study shed light on understanding the reasons for LLMs' hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating.
arXiv Detail & Related papers (2024-03-29T06:48:30Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - Alleviating Hallucinations of Large Language Models through Induced
Hallucinations [67.35512483340837]
Large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information.
We propose a simple textitInduce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations.
arXiv Detail & Related papers (2023-12-25T12:32:49Z) - HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data [102.56792377624927]
hallucinations inherent in machine-generated data remain under-explored.
We present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm.
Our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA.
arXiv Detail & Related papers (2023-11-22T04:52:58Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.