Re-evaluating Evaluation in Text Summarization
- URL: http://arxiv.org/abs/2010.07100v1
- Date: Wed, 14 Oct 2020 13:58:53 GMT
- Title: Re-evaluating Evaluation in Text Summarization
- Authors: Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu and Graham
Neubig
- Abstract summary: We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
- Score: 77.4601291738445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated evaluation metrics as a stand-in for manual evaluation are an
essential part of the development of text-generation tasks such as text
summarization. However, while the field has progressed, our standard metrics
have not -- for nearly 20 years ROUGE has been the standard evaluation in most
summarization papers. In this paper, we make an attempt to re-evaluate the
evaluation method for text summarization: assessing the reliability of
automatic metrics using top-scoring system outputs, both abstractive and
extractive, on recently popular datasets for both system-level and
summary-level evaluation settings. We find that conclusions about evaluation
metrics on older datasets do not necessarily hold on modern datasets and
systems.
Related papers
- A Critical Look at Meta-evaluating Summarisation Evaluation Metrics [11.541368732416506]
We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics.
We call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal.
arXiv Detail & Related papers (2024-09-29T01:30:13Z) - LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation [28.438103177230477]
We evaluate the efficacy of automatic metrics for assessing the factual consistency of long document text summarisation.
We propose a new evaluation framework, LongDocFACTScore, which is suitable for evaluating long document summarisation data sets.
arXiv Detail & Related papers (2023-09-21T19:54:54Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - Towards Interpretable and Efficient Automatic Reference-Based
Summarization Evaluation [160.07938471250048]
Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics.
We develop strong-performing automatic metrics for reference-based summarization evaluation.
arXiv Detail & Related papers (2023-03-07T02:49:50Z) - RISE: Leveraging Retrieval Techniques for Summarization Evaluation [3.9215337270154995]
We present RISE, a new approach for evaluating summaries by leveraging techniques from information retrieval.
RISE is first trained as a retrieval task using a dual-encoder retrieval setup, and can then be subsequently utilized for evaluating a generated summary given an input document, without gold reference summaries.
We conduct comprehensive experiments on the SummEval benchmark (Fabbri et al., 2021) and the results show that RISE has higher correlation with human evaluations compared to many past approaches to summarization evaluation.
arXiv Detail & Related papers (2022-12-17T01:09:22Z) - Podcast Summary Assessment: A Resource for Evaluating Summary Assessment
Methods [42.08097583183816]
We describe a new dataset, the podcast summary assessment corpus.
This dataset has two unique aspects: (i) long-input, speech podcast based, documents; and (ii) an opportunity to detect inappropriate reference summaries in podcast corpus.
arXiv Detail & Related papers (2022-08-28T18:24:41Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.