Related papers: STORYSUMM: Evaluating Faithfulness in Story Summarization

STORYSUMM: Evaluating Faithfulness in Story Summarization

URL: http://arxiv.org/abs/2407.06501v2
Date: Sat, 09 Nov 2024 00:42:46 GMT
Title: STORYSUMM: Evaluating Faithfulness in Story Summarization
Authors: Melanie Subbiah, Faisal Ladhak, Akankshya Mishra, Griffin Adams, Lydia B. Chilton, Kathleen McKeown,
Abstract summary: We introduce a new dataset, STORYSUMM, comprising short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies.
Score: 31.94902013480574
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, STORYSUMM, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.

Related papers

Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation [29.44609627447293]
We propose an approach to summary faithfulness evaluation in which multiple agents are assigned initial stances. We introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases. Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.
arXiv Detail & Related papers (2025-02-12T15:46:50Z)
On Positional Bias of Faithfulness for Long-form Summarization [83.63283027830657]
Large Language Models (LLMs) often exhibit positional bias in long-context settings, under-attending to information in the middle of inputs. We investigate the presence of this bias in long-form summarization, its impact on faithfulness, and various techniques to mitigate this bias.
arXiv Detail & Related papers (2024-10-31T03:50:15Z)
AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs) Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z)
BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics [70.52570641514146]
We present a benchmark of unfaithful minimal pairs (BUMP) BUMP is a dataset of 889 human-written, minimally different summary pairs. Unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics.
arXiv Detail & Related papers (2022-12-20T02:17:30Z)
Evaluating the Factual Consistency of Large Language Models Through News Summarization [97.04685401448499]
We propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. For factually inconsistent summaries, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent.
arXiv Detail & Related papers (2022-11-15T18:50:34Z)
Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking [28.66287193703365]
We propose to generate factually inconsistent summaries using source texts and reference summaries with key information masked. Experiments on seven benchmark datasets demonstrate that factual consistency classifiers trained on summaries generated using our method generally outperform existing models.
arXiv Detail & Related papers (2022-05-04T12:48:49Z)
Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation [42.63902468258758]
We propose a novel metric to evaluate the factual consistency in text summarization via counterfactual estimation. We conduct a series of experiments on three public abstractive text summarization datasets.
arXiv Detail & Related papers (2021-08-30T11:48:41Z)
Estimation of Summary-to-Text Inconsistency by Mismatched Embeddings [0.0]
We propose a new reference-free summary quality evaluation measure, with emphasis on the faithfulness. The proposed ESTIME, Estimator of Summary-to-Text Inconsistency by Mismatched Embeddings, correlates with expert scores in summary-level SummEval dataset stronger than other common evaluation measures.
arXiv Detail & Related papers (2021-04-12T01:58:21Z)
Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning. Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT. Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z)
Unsupervised Opinion Summarization with Noising and Denoising [85.49169453434554]
We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof. At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise.
arXiv Detail & Related papers (2020-04-21T16:54:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.