Factored Verification: Detecting and Reducing Hallucination in Summaries
of Academic Papers
- URL: http://arxiv.org/abs/2310.10627v1
- Date: Mon, 16 Oct 2023 17:51:17 GMT
- Title: Factored Verification: Detecting and Reducing Hallucination in Summaries
of Academic Papers
- Authors: Charlie George and Andreas Stuhlm\"uller
- Abstract summary: We use Factored Verification to detect hallucinations in abstractive summaries.
We estimate how often language models hallucinate when summarizing across multiple academic papers.
The hallucinations we find are often subtle, so we advise caution when using models to synthesize academic papers.
- Score: 1.7100359620532977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hallucination plagues even frontier LLMs--but how bad is it really for
summarizing academic papers? We evaluate Factored Verification, a simple
automated method for detecting hallucinations in abstractive summaries. This
method sets a new SotA on hallucination detection in the summarization task of
the HaluEval benchmark, achieving 76.2% accuracy. We then use this method to
estimate how often language models hallucinate when summarizing across multiple
academic papers and find 0.62 hallucinations in the average ChatGPT (16k)
summary, 0.84 for GPT-4, and 1.55 for Claude 2. We ask models to self-correct
using Factored Critiques and find that this lowers the number of hallucinations
to 0.49 for ChatGPT, 0.46 for GPT-4, and 0.95 for Claude 2. The hallucinations
we find are often subtle, so we advise caution when using models to synthesize
academic papers.
Related papers
- HalluLens: LLM Hallucination Benchmark [49.170128733508335]
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination"
This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks.
arXiv Detail & Related papers (2025-04-24T13:40:27Z) - Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling [67.14942827452161]
Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations.
In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification.
arXiv Detail & Related papers (2025-04-17T17:59:22Z) - HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection [1.8230982862848586]
We aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English.
We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples.
Results indicate a moderately positive correlation between the model's confidence scores and the actual presence of hallucinations.
arXiv Detail & Related papers (2025-03-25T13:40:22Z) - Valuable Hallucinations: Realizable Non-realistic Propositions [2.451326684641447]
This paper introduces the first formal definition of valuable hallucinations in large language models (LLMs)
We focus on the potential value that certain types of hallucinations can offer in specific contexts.
We present experiments using the Qwen2.5 model and HalluQA dataset, employing ReAct prompting to control and optimize hallucinations.
arXiv Detail & Related papers (2025-02-16T12:59:11Z) - From Single to Multi: How LLMs Hallucinate in Multi-Document Summarization [6.37435726278524]
We investigate how hallucinations manifest in large language models (LLMs) when summarizing topic-specific information from multiple documents.
On average, up to 75% of the content in LLM-generated summary is hallucinated, with hallucinations more likely to occur towards the end of the summaries.
To understand the characteristics of these hallucinations, we manually evaluate 700+ insights and find that most errors stem from either failing to follow instructions or producing overly generic insights.
arXiv Detail & Related papers (2024-10-17T18:38:53Z) - FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs [2.871226288151562]
This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs.
Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations.
Even the best hallucination detection models have near 50% accuracies on FaithBench, indicating lots of room for future improvement.
arXiv Detail & Related papers (2024-10-17T04:30:46Z) - FG-PRM: Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning [10.709365940160685]
Existing approaches primarily detect the presence of hallucinations but lack a nuanced understanding of their types and manifestations.
We introduce a comprehensive taxonomy that categorizes the common hallucinations in mathematical reasoning task into six types.
We then propose FG-PRM, an augmented model designed to detect and mitigate hallucinations in a fine-grained, step-level manner.
arXiv Detail & Related papers (2024-10-08T19:25:26Z) - ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models [65.12177400764506]
Large language models (LLMs) exhibit hallucinations in long-form question-answering tasks across various domains and wide applications.
Current hallucination detection and mitigation datasets are limited in domains and sizes.
This paper introduces an iterative self-training framework that simultaneously and progressively scales up the hallucination annotation dataset.
arXiv Detail & Related papers (2024-07-05T17:56:38Z) - ANAH: Analytical Annotation of Hallucinations in Large Language Models [65.12177400764506]
We present $textbfANAH$, a dataset that offers $textbfAN$alytical $textbfA$nnotation of hallucinations in Large Language Models.
ANAH consists of 12k sentence-level annotations for 4.3k LLM responses covering over 700 topics, constructed by a human-in-the-loop pipeline.
Thanks to the fine granularity of the hallucination annotations, we can quantitatively confirm that the hallucinations of LLMs accumulate in the answer and use ANAH to train and evaluate hallucination annotators.
arXiv Detail & Related papers (2024-05-30T17:54:40Z) - A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models [11.218649399559691]
Fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2.
We find that common quantitative metrics do not correlate well with faithfulness and quality.
arXiv Detail & Related papers (2024-02-23T16:32:28Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - Alleviating Hallucinations of Large Language Models through Induced
Hallucinations [67.35512483340837]
Large language models (LLMs) have been observed to generate responses that include inaccurate or fabricated information.
We propose a simple textitInduce-then-Contrast Decoding (ICD) strategy to alleviate hallucinations.
arXiv Detail & Related papers (2023-12-25T12:32:49Z) - Evaluating Hallucinations in Chinese Large Language Models [65.4771562909392]
We establish a benchmark named HalluQA (Chinese Hallucination Question-Answering) to measure the hallucination phenomenon in Chinese large language models.
We consider two types of hallucinations: imitative falsehoods and factual errors, and we construct adversarial samples based on GLM-130B and ChatGPT.
For evaluation, we design an automated evaluation method using GPT-4 to judge whether a model output is hallucinated.
arXiv Detail & Related papers (2023-10-05T07:57:09Z) - HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large
Language Models [146.87696738011712]
Large language models (LLMs) are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge.
To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for Large Language Models (HaluEval)
arXiv Detail & Related papers (2023-05-19T15:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.