Chain-of-Verification Reduces Hallucination in Large Language Models
- URL: http://arxiv.org/abs/2309.11495v2
- Date: Mon, 25 Sep 2023 15:25:49 GMT
- Title: Chain-of-Verification Reduces Hallucination in Large Language Models
- Authors: Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian
Li, Asli Celikyilmaz, Jason Weston
- Abstract summary: We study the ability of language models to deliberate on the responses they give in order to correct their mistakes.
We develop the Chain-of-Verification (CoVe) method whereby the model first drafts an initial response.
We show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata to closed book MultiSpanQA.
- Score: 80.99318041981776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generation of plausible yet incorrect factual information, termed
hallucination, is an unsolved issue in large language models. We study the
ability of language models to deliberate on the responses they give in order to
correct their mistakes. We develop the Chain-of-Verification (CoVe) method
whereby the model first (i) drafts an initial response; then (ii) plans
verification questions to fact-check its draft; (iii) answers those questions
independently so the answers are not biased by other responses; and (iv)
generates its final verified response. In experiments, we show CoVe decreases
hallucinations across a variety of tasks, from list-based questions from
Wikidata, closed book MultiSpanQA and longform text generation.
Related papers
- Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers [0.0]
We propose, implement, and apply the Language Model Chain (LMC) algorithm.<n>In this, a language model's response to a given prompt is only correct if it exists in the collection of possible answers.<n>We used the LMC algorithm to extract patient dates of birth from medical documents.
arXiv Detail & Related papers (2025-07-21T14:31:16Z) - Hallucination Detection with Small Language Models [1.9181612035055007]
This paper proposes a framework that integrates multiple small language models to verify responses generated by large language models.<n>The results demonstrate a 10% improvement in F1 scores for detecting correct responses compared to hallucinations.
arXiv Detail & Related papers (2025-06-24T02:19:26Z) - keepitsimple at SemEval-2025 Task 3: LLM-Uncertainty based Approach for Multilingual Hallucination Span Detection [0.0]
Identification of hallucination spans in black-box language model generated text is essential for applications in the real world.<n>We present our solution to this problem, which capitalizes on the variability ofally-sampled responses in order to identify hallucinated spans.<n>We measure this divergence through entropy-based analysis, allowing for accurate identification of hallucinated segments.
arXiv Detail & Related papers (2025-05-23T05:25:14Z) - A Unified Hallucination Mitigation Framework for Large Vision-Language Models [18.595958586621943]
We present a unified framework, Dentist, for hallucination mitigation.
The core step is to first classify the queries, then perform different processes of hallucination mitigation based on the classification result.
On MMbench, we achieve a 13.44%/10.2%/15.8% improvement in accuracy on Image Quality.
arXiv Detail & Related papers (2024-09-24T22:36:58Z) - Localizing and Mitigating Errors in Long-form Question Answering [79.63372684264921]
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension.
This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers.
arXiv Detail & Related papers (2024-07-16T17:23:16Z) - Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information.
We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets.
Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z) - On Large Language Models' Hallucination with Regard to Known Facts [74.96789694959894]
Large language models are successful in answering factoid questions but are also prone to hallucination.
We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics.
Our study shed light on understanding the reasons for LLMs' hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating.
arXiv Detail & Related papers (2024-03-29T06:48:30Z) - Don't Just Say "I don't know"! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations [70.6395572287422]
Self-alignment method is capable of not only refusing to answer but also providing explanation to the unanswerability of unknown questions.
We conduct disparity-driven self-curation to select qualified data for fine-tuning the LLM itself for aligning the responses to unknown questions as desired.
arXiv Detail & Related papers (2024-02-23T02:24:36Z) - Ever: Mitigating Hallucination in Large Language Models through
Real-Time Verification and Rectification [18.59695929601458]
We introduce a novel approach called Real-time Verification and Rectification (Ever)
Ever employs a real-time, step-wise generation and hallucination rectification strategy.
Ever demonstrates a significant improvement in generating trustworthy and factually accurate text across a diverse range of tasks.
arXiv Detail & Related papers (2023-11-15T17:04:56Z) - Weakly Supervised Visual Question Answer Generation [2.7605547688813172]
We present a weakly supervised method that synthetically generates question-answer pairs procedurally from visual information and captions.
We perform an exhaustive experimental analysis on VQA dataset and see that our model significantly outperforms SOTA methods on BLEU scores.
arXiv Detail & Related papers (2023-06-11T08:46:42Z) - CLAM: Selective Clarification for Ambiguous Questions with Large
Language Models [37.37606905433334]
We show that current SotA models do not ask the user for clarification when presented with imprecise questions.
We introduce CLAM, a framework that first uses the model to detect ambiguous questions and if an ambiguous question is detected, prompts the model to ask the user for clarification.
We show that our method achieves a 20.15 percentage point accuracy improvement over SotA on a novel ambiguous question-answering answering data set.
arXiv Detail & Related papers (2022-12-15T12:47:18Z) - Read before Generate! Faithful Long Form Question Answering with Machine
Reading [77.17898499652306]
Long-form question answering (LFQA) aims to generate a paragraph-length answer for a given question.
We propose a new end-to-end framework that jointly models answer generation and machine reading.
arXiv Detail & Related papers (2022-03-01T10:41:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.