WIDAR -- Weighted Input Document Augmented ROUGE
- URL: http://arxiv.org/abs/2201.09282v1
- Date: Sun, 23 Jan 2022 14:40:42 GMT
- Title: WIDAR -- Weighted Input Document Augmented ROUGE
- Authors: Raghav Jain, Vaibhav Mavi, Anubhav Jangra, Sriparna Saha
- Abstract summary: The proposed metric WIDAR is designed to adapt the evaluation score according to the quality of the reference summary.
The proposed metric correlates better than ROUGE by 26%, 76%, 82%, and 15%, respectively, in coherence, consistency, fluency, and relevance on human judgement scores.
- Score: 26.123086537577155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of automatic text summarization has gained a lot of traction due to
the recent advancements in machine learning techniques. However, evaluating the
quality of a generated summary remains to be an open problem. The literature
has widely adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as
the standard evaluation metric for summarization. However, ROUGE has some
long-established limitations; a major one being its dependence on the
availability of good quality reference summary. In this work, we propose the
metric WIDAR which in addition to utilizing the reference summary uses also the
input document in order to evaluate the quality of the generated summary. The
proposed metric is versatile, since it is designed to adapt the evaluation
score according to the quality of the reference summary. The proposed metric
correlates better than ROUGE by 26%, 76%, 82%, and 15%, respectively, in
coherence, consistency, fluency, and relevance on human judgement scores
provided in the SummEval dataset. The proposed metric is able to obtain
comparable results with other state-of-the-art metrics while requiring a
relatively short computational time.
Related papers
- Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics [4.881135687863645]
We introduce a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute.
We show that this metric can also be used alongside reference-based metrics to improve their robustness in low quality reference settings.
arXiv Detail & Related papers (2024-10-08T11:09:25Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Evaluating Code Summarization Techniques: A New Metric and an Empirical
Characterization [16.127739014966487]
We investigate the complementarity of different types of metrics in capturing the quality of a generated summary.
We present a new metric based on contrastive learning to capture said aspect.
arXiv Detail & Related papers (2023-12-24T13:12:39Z) - DocAsRef: An Empirical Study on Repurposing Reference-Based Summary
Quality Metrics Reference-Freely [29.4981129248937]
We propose that some reference-based metrics can be effectively adapted to assess a system summary against its corresponding reference.
After being repurposed reference-freely, the zero-shot BERTScore consistently outperforms its original reference-based version.
It also excels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5.
arXiv Detail & Related papers (2022-12-20T06:01:13Z) - A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.