Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive
Summarization
- URL: http://arxiv.org/abs/2308.02270v1
- Date: Fri, 4 Aug 2023 11:47:19 GMT
- Title: Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive
Summarization
- Authors: Mousumi Akter, Shubhra Kanti Karmaker Santu
- Abstract summary: ROUGE metric has long been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the summarizer.
We propose a redundancy-aware Sem-nCG metric and demonstrate how this new metric can be used to evaluate model summaries against multiple references.
- Score: 1.022898441415693
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While very popular for evaluating extractive summarization task, the ROUGE
metric has long been criticized for its lack of semantic awareness and its
ignorance about the ranking quality of the summarizer. Thanks to previous
research that has addressed these issues by proposing a gain-based automated
metric called Sem-nCG, which is both rank and semantic aware. However, Sem-nCG
does not consider the amount of redundancy present in a model-generated summary
and currently does not support evaluation with multiple reference summaries.
Unfortunately, addressing both these limitations simultaneously is not trivial.
Therefore, in this paper, we propose a redundancy-aware Sem-nCG metric and
demonstrate how this new metric can be used to evaluate model summaries against
multiple references. We also explore different ways of incorporating redundancy
into the original metric through extensive experiments. Experimental results
demonstrate that the new redundancy-aware metric exhibits a higher correlation
with human judgments than the original Sem-nCG metric for both single and
multiple reference scenarios.
Related papers
- Evaluating Code Summarization Techniques: A New Metric and an Empirical
Characterization [16.127739014966487]
We investigate the complementarity of different types of metrics in capturing the quality of a generated summary.
We present a new metric based on contrastive learning to capture said aspect.
arXiv Detail & Related papers (2023-12-24T13:12:39Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - Improving abstractive summarization with energy-based re-ranking [4.311978285976062]
We propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics.
We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries.
arXiv Detail & Related papers (2022-10-27T15:43:36Z) - A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z) - SueNes: A Weakly Supervised Approach to Evaluating Single-Document
Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries.
Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z) - SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for
Multi-Document Summarization [31.082618343998533]
We propose SUPERT, which rates the quality of a summary by measuring its semantic similarity with a pseudo reference summary.
Compared to the state-of-the-art unsupervised evaluation metrics, SUPERT correlates better with human ratings by 18-39%.
We use SUPERT as rewards to guide a neural-based reinforcement learning summarizer, yielding favorable performance compared to the state-of-the-art unsupervised summarizers.
arXiv Detail & Related papers (2020-05-07T19:54:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.