Evaluating Factual Consistency of Summaries with Large Language Models
- URL: http://arxiv.org/abs/2305.14069v2
- Date: Thu, 12 Oct 2023 06:20:42 GMT
- Title: Evaluating Factual Consistency of Summaries with Large Language Models
- Authors: Shiqi Chen, Siyang Gao and Junxian He
- Abstract summary: We explore evaluating factual consistency of summaries by directly prompting large language models (LLMs)
Our experiments demonstrate that prompting LLMs is able to outperform the previous best factuality systems in all settings.
- Score: 24.416837319515896
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Detecting factual errors in summaries has been an important and challenging
subject in summarization research. Inspired by the emergent ability of large
language models (LLMs), we explore evaluating factual consistency of summaries
by directly prompting LLMs. We present a comprehensive empirical study to
assess the ability of LLMs as factual consistency evaluators, which consists of
(1) analyzing different LLMs such as the GPT model series and Flan-T5; (2)
investigating a variety of prompting methods including vanilla prompting,
chain-of-thought prompting, and a sentence-by-sentence prompting method to
tackle long summaries; and (3) evaluating on diverse summaries generated by
multiple summarization systems, ranging from pre-transformer methods to SOTA
pretrained models. Our experiments demonstrate that prompting LLMs is able to
outperform the previous best factuality systems in all settings, by up to 12.2
absolute points in terms of the binary classification accuracy on inconsistency
detection.
Related papers
- Learning to Refine with Fine-Grained Natural Language Feedback [81.70313509881315]
We propose looking at refinement with feedback as a composition of three distinct LLM competencies.
A key property of this approach is that the step 2 critique model can give fine-grained feedback about errors.
We show that models of different capabilities benefit from refining with this approach on the task of improving factual consistency of document grounded summaries.
arXiv Detail & Related papers (2024-07-02T16:15:01Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We show that prompting-based rationales align better with human-annotated rationales than attribution-based rationales.
We additionally find that the faithfulness limitations of prompting-based methods, which are identified in previous work, may be linked to their collapsed predictions.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Factual Dialogue Summarization via Learning from Large Language Models [35.63037083806503]
Large language model (LLM)-based automatic text summarization models generate more factually consistent summaries.
We employ zero-shot learning to extract symbolic knowledge from LLMs, generating factually consistent (positive) and inconsistent (negative) summaries.
Our approach achieves better factual consistency while maintaining coherence, fluency, and relevance, as confirmed by various automatic evaluation metrics.
arXiv Detail & Related papers (2024-06-20T20:03:37Z) - SIFiD: Reassess Summary Factual Inconsistency Detection with LLM [27.392514180175283]
This study reassesses summary inconsistency detection with Large Language Models (LLMs)
We propose SIFiD (Summary Inconsistency Detection with Filtered Document) that identify key sentences within documents by either employing natural language inference or measuring semantic similarity between summaries and documents.
arXiv Detail & Related papers (2024-03-12T11:41:51Z) - TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization [29.49641083851667]
We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes.
We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences.
arXiv Detail & Related papers (2024-02-20T18:58:49Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Are Large Language Models Reliable Judges? A Study on the Factuality
Evaluation Capabilities of LLMs [8.526956860672698]
Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities.
This study investigates the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models.
arXiv Detail & Related papers (2023-11-01T17:42:45Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - Summarization is (Almost) Dead [49.360752383801305]
We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of large language models (LLMs)
Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models.
arXiv Detail & Related papers (2023-09-18T08:13:01Z) - Semantic Consistency for Assuring Reliability of Large Language Models [9.876355290198639]
Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks.
We introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs.
We propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency.
arXiv Detail & Related papers (2023-08-17T18:11:33Z) - On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets.
We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.