Related papers: Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics

Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics

URL: http://arxiv.org/abs/2011.04096v1
Date: Sun, 8 Nov 2020 22:26:06 GMT
Title: Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics
Authors: Manik Bhandari, Pranav Gour, Atabak Ashfaq, Pengfei Liu
Abstract summary: One exemplar work concludes that automatic metrics strongly disagree when ranking high-scoring summaries. We find that their observations stem from the fact that metrics disagree in ranking summaries from any narrow scoring range. Apart from the width of the scoring range of summaries, we analyze three other properties that impact inter-metric agreement.
Score: 20.105119107290488
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In text summarization, evaluating the efficacy of automatic metrics without human judgments has become recently popular. One exemplar work concludes that automatic metrics strongly disagree when ranking high-scoring summaries. In this paper, we revisit their experiments and find that their observations stem from the fact that metrics disagree in ranking summaries from any narrow scoring range. We hypothesize that this may be because summaries are similar to each other in a narrow scoring range and are thus, difficult to rank. Apart from the width of the scoring range of summaries, we analyze three other properties that impact inter-metric agreement - Ease of Summarization, Abstractiveness, and Coverage. To encourage reproducible research, we make all our analysis code and data publicly available.

Related papers

Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation [21.650619533772232]
This work investigates whether and to what degree superficial attributes of summary texts suffice to predict factuality'' We then evaluate how factuality metrics respond to factual corrections in inconsistent summaries and find that only a few show meaningful improvements. Motivated by these insights, we show that one can game'' (most) automatic factuality metrics, i.e., reliably inflate factuality'' scores by appending innocuous sentences to generated summaries.
arXiv Detail & Related papers (2024-11-25T18:15:15Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Incremental Extractive Opinion Summarization Using Cover Trees [81.59625423421355]
In online marketplaces user reviews accumulate over time, and opinion summaries need to be updated periodically. In this work, we study the task of extractive opinion summarization in an incremental setting. We present an efficient algorithm for accurately computing the CentroidRank summaries in an incremental setting.
arXiv Detail & Related papers (2024-01-16T02:00:17Z)
OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models. Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z)
Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods. We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment. We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z)
A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric. Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score. Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z)
Improving Factual Consistency of Abstractive Summarization via Question Answering [25.725873545789046]
We present an approach to address factual consistency in summarization. We first propose an efficient automatic evaluation metric to measure factual consistency. We then propose a novel learning algorithm that maximizes the proposed metric during model training.
arXiv Detail & Related papers (2021-05-10T19:07:21Z)
Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries. We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z)
SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.