Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary
- URL: http://arxiv.org/abs/2010.00490v3
- Date: Mon, 26 Jul 2021 18:47:26 GMT
- Title: Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary
- Authors: Daniel Deutsch, Tania Bedrax-Weiss, Dan Roth
- Abstract summary: We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
- Score: 65.37544133256499
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A desirable property of a reference-based evaluation metric that measures the
content quality of a summary is that it should estimate how much information
that summary has in common with a reference. Traditional text overlap based
metrics such as ROUGE fail to achieve this because they are limited to matching
tokens, either lexically or via embeddings. In this work, we propose a metric
to evaluate the content quality of a summary using question-answering (QA).
QA-based methods directly measure a summary's information overlap with a
reference, making them fundamentally different than text overlap metrics. We
demonstrate the experimental benefits of QA-based metrics through an analysis
of our proposed metric, QAEval. QAEval out-performs current state-of-the-art
metrics on most evaluations using benchmark datasets, while being competitive
on others due to limitations of state-of-the-art models. Through a careful
analysis of each component of QAEval, we identify its performance bottlenecks
and estimate that its potential upper-bound performance surpasses all other
automatic metrics, approaching that of the gold-standard Pyramid Method.
Related papers
- Benchmarking Answer Verification Methods for Question Answering-Based
Summarization Evaluation Metrics [74.28810048824519]
Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not.
We benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods.
arXiv Detail & Related papers (2022-04-21T15:43:45Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - WIDAR -- Weighted Input Document Augmented ROUGE [26.123086537577155]
The proposed metric WIDAR is designed to adapt the evaluation score according to the quality of the reference summary.
The proposed metric correlates better than ROUGE by 26%, 76%, 82%, and 15%, respectively, in coherence, consistency, fluency, and relevance on human judgement scores.
arXiv Detail & Related papers (2022-01-23T14:40:42Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.