SummEval: Re-evaluating Summarization Evaluation
- URL: http://arxiv.org/abs/2007.12626v4
- Date: Mon, 1 Feb 2021 19:56:15 GMT
- Title: SummEval: Re-evaluating Summarization Evaluation
- Authors: Alexander R. Fabbri, Wojciech Kry\'sci\'nski, Bryan McCann, Caiming
Xiong, Richard Socher, Dragomir Radev
- Abstract summary: We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
- Score: 169.622515287256
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The scarcity of comprehensive up-to-date studies on evaluation metrics for
text summarization and the lack of consensus regarding evaluation protocols
continue to inhibit progress. We address the existing shortcomings of
summarization evaluation methods along five dimensions: 1) we re-evaluate 14
automatic evaluation metrics in a comprehensive and consistent fashion using
neural summarization model outputs along with expert and crowd-sourced human
annotations, 2) we consistently benchmark 23 recent summarization models using
the aforementioned automatic evaluation metrics, 3) we assemble the largest
collection of summaries generated by models trained on the CNN/DailyMail news
dataset and share it in a unified format, 4) we implement and share a toolkit
that provides an extensible and unified API for evaluating summarization models
across a broad range of automatic metrics, 5) we assemble and share the largest
and most diverse, in terms of model types, collection of human judgments of
model-generated summaries on the CNN/Daily Mail dataset annotated by both
expert judges and crowd-source workers. We hope that this work will help
promote a more complete evaluation protocol for text summarization as well as
advance research in developing evaluation metrics that better correlate with
human judgments.
Related papers
- Assessment of Transformer-Based Encoder-Decoder Model for Human-Like Summarization [0.05852077003870416]
This work leverages transformer-based BART model for human-like summarization.
On training and fine-tuning the encoder-decoder model, it is tested with diverse sample articles.
The finetuned model performance is compared with the baseline pretrained model.
Empirical results on BBC News articles highlight that the gold standard summaries written by humans are more factually consistent by 17%.
arXiv Detail & Related papers (2024-10-22T09:25:04Z) - OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models.
Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Prompted Opinion Summarization with GPT-3.5 [115.95460650578678]
We show that GPT-3.5 models achieve very strong performance in human evaluation.
We argue that standard evaluation metrics do not reflect this, and introduce three new metrics targeting faithfulness, factuality, and genericity.
arXiv Detail & Related papers (2022-11-29T04:06:21Z) - How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field.
We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z) - Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information.
We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols.
We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z) - Learning by Semantic Similarity Makes Abstractive Summarization Better [13.324006587838522]
We compare the generated summaries from recent LM, BART, and the reference summaries from a benchmark dataset, CNN/DM.
Interestingly, model-generated summaries receive higher scores relative to reference summaries.
arXiv Detail & Related papers (2020-02-18T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.