Evaluating the Factual Consistency of Large Language Models Through News
Summarization
- URL: http://arxiv.org/abs/2211.08412v2
- Date: Sat, 2 Dec 2023 18:10:15 GMT
- Title: Evaluating the Factual Consistency of Large Language Models Through News
Summarization
- Authors: Derek Tam, Anisha Mascarenhas, Shiyue Zhang, Sarah Kwan, Mohit Bansal,
Colin Raffel
- Abstract summary: We propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization.
For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent.
For factually inconsistent summaries, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent.
- Score: 97.04685401448499
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While large language models (LLMs) have proven to be effective on a large
variety of tasks, they are also known to hallucinate information. To measure
whether an LLM prefers factually consistent continuations of its input, we
propose a new benchmark called FIB(Factual Inconsistency Benchmark) that
focuses on the task of summarization. Specifically, our benchmark involves
comparing the scores an LLM assigns to a factually consistent versus a
factually inconsistent summary for an input news article. For factually
consistent summaries, we use human-written reference summaries that we manually
verify as factually consistent. To generate summaries that are factually
inconsistent, we generate summaries from a suite of summarization models that
we have manually annotated as factually inconsistent. A model's factual
consistency is then measured according to its accuracy, i.e.\ the proportion of
documents where it assigns a higher score to the factually consistent summary.
To validate the usefulness of FIB, we evaluate 23 large language models ranging
from 1B to 176B parameters from six different model families including BLOOM
and OPT. We find that existing LLMs generally assign a higher score to
factually consistent summaries than to factually inconsistent summaries.
However, if the factually inconsistent summaries occur verbatim in the
document, then LLMs assign a higher score to these factually inconsistent
summaries than factually consistent summaries. We validate design choices in
our benchmark including the scoring method and source of distractor summaries.
Our code and benchmark data can be found at https://github.com/r-three/fib.
Related papers
- Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs)
Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z) - On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries.
Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens.
We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z) - BooookScore: A systematic exploration of book-length summarization in the era of LLMs [53.42917858142565]
We develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types.
We find that closed-source LLMs such as GPT-4 and 2 produce summaries with higher BooookScore than those generated by open-source models.
arXiv Detail & Related papers (2023-10-01T20:46:44Z) - Element-aware Summarization with Large Language Models: Expert-aligned
Evaluation and Chain-of-Thought Method [35.181659789684545]
Automatic summarization generates concise summaries that contain key ideas of source documents.
References from CNN/DailyMail and BBC XSum are noisy, mainly in terms of factual hallucination and information redundancy.
We propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step.
Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L.
arXiv Detail & Related papers (2023-05-22T18:54:35Z) - BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of
Faithfulness Metrics [70.52570641514146]
We present a benchmark of unfaithful minimal pairs (BUMP)
BUMP is a dataset of 889 human-written, minimally different summary pairs.
Unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics.
arXiv Detail & Related papers (2022-12-20T02:17:30Z) - Correcting Diverse Factual Errors in Abstractive Summarization via
Post-Editing and Language Model Infilling [56.70682379371534]
We show that our approach vastly outperforms prior methods in correcting erroneous summaries.
Our model -- FactEdit -- improves factuality scores by over 11 points on CNN/DM and over 31 points on XSum.
arXiv Detail & Related papers (2022-10-22T07:16:19Z) - Masked Summarization to Generate Factually Inconsistent Summaries for
Improved Factual Consistency Checking [28.66287193703365]
We propose to generate factually inconsistent summaries using source texts and reference summaries with key information masked.
Experiments on seven benchmark datasets demonstrate that factual consistency classifiers trained on summaries generated using our method generally outperform existing models.
arXiv Detail & Related papers (2022-05-04T12:48:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.