How Far are We from Robust Long Abstractive Summarization?
- URL: http://arxiv.org/abs/2210.16732v1
- Date: Sun, 30 Oct 2022 03:19:50 GMT
- Title: How Far are We from Robust Long Abstractive Summarization?
- Authors: Huan Yee Koh, Jiaxin Ju, He Zhang, Ming Liu, Shirui Pan
- Abstract summary: We evaluate long document abstractive summarization systems (i.e., models and metrics) with the aim of implementing them to generate reliable summaries.
For long document evaluation metrics, human evaluation results show that ROUGE remains the best at evaluating the relevancy of a summary.
We release our annotated long document dataset with the hope that it can contribute to the development of metrics across a broader range of summarization settings.
- Score: 39.34743996451813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Abstractive summarization has made tremendous progress in recent years. In
this work, we perform fine-grained human annotations to evaluate long document
abstractive summarization systems (i.e., models and metrics) with the aim of
implementing them to generate reliable summaries. For long document abstractive
models, we show that the constant strive for state-of-the-art ROUGE results can
lead us to generate more relevant summaries but not factual ones. For long
document evaluation metrics, human evaluation results show that ROUGE remains
the best at evaluating the relevancy of a summary. It also reveals important
limitations of factuality metrics in detecting different types of factual
errors and the reasons behind the effectiveness of BARTScore. We then suggest
promising directions in the endeavor of developing factual consistency metrics.
Finally, we release our annotated long document dataset with the hope that it
can contribute to the development of metrics across a broader range of
summarization settings.
Related papers
- FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation [28.438103177230477]
We evaluate the efficacy of automatic metrics for assessing the factual consistency of long document text summarisation.
We propose a new evaluation framework, LongDocFACTScore, which is suitable for evaluating long document summarisation data sets.
arXiv Detail & Related papers (2023-09-21T19:54:54Z) - Evaluating and Improving Factuality in Multimodal Abstractive
Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary.
We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization.
Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z) - SNaC: Coherence Error Detection for Narrative Summarization [73.48220043216087]
We introduce SNaC, a narrative coherence evaluation framework rooted in fine-grained annotations for long summaries.
We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie screenplay summaries.
Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowd annotators.
arXiv Detail & Related papers (2022-05-19T16:01:47Z) - Efficient Attentions for Long Document Summarization [25.234852272297598]
Hepos is a novel efficient encoder-decoder attention with head-wise positional strides.
We are able to process ten times more tokens than existing models that use full attentions.
arXiv Detail & Related papers (2021-04-05T18:45:13Z) - On Generating Extended Summaries of Long Documents [16.149617108647707]
We present a new method for generating extended summaries of long papers.
Our method exploits hierarchical structure of the documents and incorporates it into an extractive summarization model.
Our analysis shows that our multi-tasking approach can adjust extraction probability distribution to the favor of summary-worthy sentences.
arXiv Detail & Related papers (2020-12-28T08:10:28Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z) - Massive Multi-Document Summarization of Product Reviews with Weak
Supervision [11.462916848094403]
Product reviews summarization is a type of Multi-Document Summarization (MDS) task.
We show that summarizing small samples of the reviews can result in loss of important information.
We propose a schema for summarizing a massive set of reviews on top of a standard summarization algorithm.
arXiv Detail & Related papers (2020-07-22T11:22:57Z) - Unsupervised Opinion Summarization with Noising and Denoising [85.49169453434554]
We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof.
At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise.
arXiv Detail & Related papers (2020-04-21T16:54:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.