Related papers: Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization

Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization

URL: http://arxiv.org/abs/2308.02270v1
Date: Fri, 4 Aug 2023 11:47:19 GMT
Title: Redundancy Aware Multi-Reference Based Gainwise Evaluation of Extractive Summarization
Authors: Mousumi Akter, Shubhra Kanti Karmaker Santu
Abstract summary: ROUGE metric has long been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the summarizer. We propose a redundancy-aware Sem-nCG metric and demonstrate how this new metric can be used to evaluate model summaries against multiple references.
Score: 1.022898441415693
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While very popular for evaluating extractive summarization task, the ROUGE metric has long been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the summarizer. Thanks to previous research that has addressed these issues by proposing a gain-based automated metric called Sem-nCG, which is both rank and semantic aware. However, Sem-nCG does not consider the amount of redundancy present in a model-generated summary and currently does not support evaluation with multiple reference summaries. Unfortunately, addressing both these limitations simultaneously is not trivial. Therefore, in this paper, we propose a redundancy-aware Sem-nCG metric and demonstrate how this new metric can be used to evaluate model summaries against multiple references. We also explore different ways of incorporating redundancy into the original metric through extensive experiments. Experimental results demonstrate that the new redundancy-aware metric exhibits a higher correlation with human judgments than the original Sem-nCG metric for both single and multiple reference scenarios.

Related papers

References Matter: Investigating the Impact of Reference Set Variation on Summarization Evaluation [25.428322811598722]
This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets.<n>We demonstrate that many popular metrics exhibit significant instability.<n>This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets.
arXiv Detail & Related papers (2025-06-17T09:17:41Z)
Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics [4.881135687863645]
We introduce a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute. We show that this metric can also be used alongside reference-based metrics to improve their robustness in low quality reference settings.
arXiv Detail & Related papers (2024-10-08T11:09:25Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
Improving abstractive summarization with energy-based re-ranking [4.311978285976062]
We propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics. We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries.
arXiv Detail & Related papers (2022-10-27T15:43:36Z)
WIDAR -- Weighted Input Document Augmented ROUGE [26.123086537577155]
The proposed metric WIDAR is designed to adapt the evaluation score according to the quality of the reference summary. The proposed metric correlates better than ROUGE by 26%, 76%, 82%, and 15%, respectively, in coherence, consistency, fluency, and relevance on human judgement scores.
arXiv Detail & Related papers (2022-01-23T14:40:42Z)
A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric. Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score. Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z)
REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems. A prediction model is designed to estimate the reliability of the given reference set. We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z)
Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries. We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z)
Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning. Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT. Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z)
Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA) We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.