Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries
- URL: http://arxiv.org/abs/2010.12495v1
- Date: Fri, 23 Oct 2020 15:55:15 GMT
- Title: Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries
- Authors: Daniel Deutsch, Dan Roth
- Abstract summary: We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
- Score: 74.28810048824519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reference-based metrics such as ROUGE or BERTScore evaluate the content
quality of a summary by comparing the summary to a reference. Ideally, this
comparison should measure the summary's information quality by calculating how
much information the summaries have in common. In this work, we analyze the
token alignments used by ROUGE and BERTScore to compare summaries and argue
that their scores largely cannot be interpreted as measuring information
overlap, but rather the extent to which they discuss the same topics. Further,
we provide evidence that this result holds true for many other summarization
evaluation metrics. The consequence of this result is that it means the
summarization community has not yet found a reliable automatic metric that
aligns with its research goal, to generate summaries with high-quality
information. Then, we propose a simple and interpretable method of evaluating
summaries which does directly measure information overlap and demonstrate how
it can be used to gain insights into model behavior that could not be provided
by other methods alone.
Related papers
- Is Summary Useful or Not? An Extrinsic Human Evaluation of Text
Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods.
We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment.
We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z) - SWING: Balancing Coverage and Faithfulness for Dialogue Summarization [67.76393867114923]
We propose to utilize natural language inference (NLI) models to improve coverage while avoiding factual inconsistencies.
We use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered.
Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach.
arXiv Detail & Related papers (2023-01-25T09:33:11Z) - Comparing Methods for Extractive Summarization of Call Centre Dialogue [77.34726150561087]
We experimentally compare several such methods by using them to produce summaries of calls, and evaluating these summaries objectively.
We found that TopicSum and Lead-N outperform the other summarisation methods, whilst BERTSum received comparatively lower scores in both subjective and objective evaluations.
arXiv Detail & Related papers (2022-09-06T13:16:02Z) - A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z) - Estimation of Summary-to-Text Inconsistency by Mismatched Embeddings [0.0]
We propose a new reference-free summary quality evaluation measure, with emphasis on the faithfulness.
The proposed ESTIME, Estimator of Summary-to-Text Inconsistency by Mismatched Embeddings, correlates with expert scores in summary-level SummEval dataset stronger than other common evaluation measures.
arXiv Detail & Related papers (2021-04-12T01:58:21Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.