QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization
- URL: http://arxiv.org/abs/2112.08542v1
- Date: Thu, 16 Dec 2021 00:38:35 GMT
- Title: QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization
- Authors: Alexander R. Fabbri, Chien-Sheng Wu, Wenhao Liu, Caiming Xiong
- Abstract summary: We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
- Score: 116.56171113972944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Factual consistency is an essential quality of text summarization models in
practical settings. Existing work in evaluating this dimension can be broadly
categorized into two lines of research, entailment-based metrics and question
answering (QA)-based metrics. However, differing experimental setups presented
in recent work lead to contrasting conclusions as to which paradigm performs
best. In this work, we conduct an extensive comparison of entailment and
QA-based metrics, demonstrating that carefully choosing the components of a
QA-based metric is critical to performance. Building on those insights, we
propose an optimized metric, which we call QAFactEval, that leads to a 15%
average improvement over previous QA-based metrics on the SummaC factual
consistency benchmark. Our solution improves upon the best-performing
entailment-based metric and achieves state-of-the-art performance on this
benchmark. Furthermore, we find that QA-based and entailment-based metrics
offer complementary signals and combine the two into a single, learned metric
for further performance boost. Through qualitative and quantitative analyses,
we point to question generation and answerability classification as two
critical components for future work in QA-based metrics.
Related papers
- A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics [6.571049277167304]
We study the statistics of the existing evaluation metrics for a better understanding of their limitations.
As a potential solution, we discuss how a Mixture Of Grader could potentially improve the auto QA evaluator quality.
arXiv Detail & Related papers (2024-10-13T22:10:42Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Evaluating and Improving Factuality in Multimodal Abstractive
Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary.
We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization.
Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z) - Benchmarking Answer Verification Methods for Question Answering-Based
Summarization Evaluation Metrics [74.28810048824519]
Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not.
We benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods.
arXiv Detail & Related papers (2022-04-21T15:43:45Z) - DirectQE: Direct Pretraining for Machine Translation Quality Estimation [41.187833219223336]
We argue that there are gaps between the predictor and the estimator in both data quality and training objectives.
We propose a novel framework called DirectQE that provides a direct pretraining for QE tasks.
arXiv Detail & Related papers (2021-05-15T06:18:49Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.