Benchmarking Answer Verification Methods for Question Answering-Based
Summarization Evaluation Metrics
- URL: http://arxiv.org/abs/2204.10206v1
- Date: Thu, 21 Apr 2022 15:43:45 GMT
- Title: Benchmarking Answer Verification Methods for Question Answering-Based
Summarization Evaluation Metrics
- Authors: Daniel Deutsch and Dan Roth
- Abstract summary: Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not.
We benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods.
- Score: 74.28810048824519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Question answering-based summarization evaluation metrics must automatically
determine whether the QA model's prediction is correct or not, a task known as
answer verification. In this work, we benchmark the lexical answer verification
methods which have been used by current QA-based metrics as well as two more
sophisticated text comparison methods, BERTScore and LERC. We find that LERC
out-performs the other methods in some settings while remaining statistically
indistinguishable from lexical overlap in others. However, our experiments
reveal that improved verification performance does not necessarily translate to
overall QA-based metric quality: In some scenarios, using a worse verification
method -- or using none at all -- has comparable performance to using the best
verification method, a result that we attribute to properties of the datasets.
Related papers
- How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs? [3.1706553206969925]
We perform a meta-evaluation of such methods and assess their reliability across a broad range of tasks.
We observe that while automatic evaluation methods can approximate human ratings under specific conditions, their validity is highly context-dependent.
Our findings enhance the understanding of how automatic methods should be applied and interpreted when developing and evaluating instruction-tuned LLMs.
arXiv Detail & Related papers (2024-02-16T15:48:33Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.