Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question
Answering Evaluation
- URL: http://arxiv.org/abs/2202.07654v1
- Date: Tue, 15 Feb 2022 18:53:58 GMT
- Title: Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question
Answering Evaluation
- Authors: Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin
Boerschinger, Tal Schuster
- Abstract summary: Question answering systems are typically evaluated against manually annotated finite sets of one or more answers.
This leads to a coverage limitation that results in underestimating the true performance of systems.
We present the first systematic conceptual and data-driven analysis to examine the shortcomings of token-level equivalence measures.
- Score: 11.733609600774306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The predictions of question answering (QA) systems are typically evaluated
against manually annotated finite sets of one or more answers. This leads to a
coverage limitation that results in underestimating the true performance of
systems, and is typically addressed by extending over exact match (EM) with
predefined rules or with the token-level F1 measure. In this paper, we present
the first systematic conceptual and data-driven analysis to examine the
shortcomings of token-level equivalence measures.
To this end, we define the asymmetric notion of answer equivalence (AE),
accepting answers that are equivalent to or improve over the reference, and
collect over 26K human judgements for candidates produced by multiple QA
systems on SQuAD. Through a careful analysis of this data, we reveal and
quantify several concrete limitations of the F1 measure, such as false
impression of graduality, missing dependence on question, and more.
Since collecting AE annotations for each evaluated model is expensive, we
learn a BERT matching BEM measure to approximate this task. Being a simpler
task than QA, we find BEM to provide significantly better AE approximations
than F1, and more accurately reflect the performance of systems.
Finally, we also demonstrate the practical utility of AE and BEM on the
concrete application of minimal accurate prediction sets, reducing the number
of required answers by up to 2.6 times.
Related papers
- UniOQA: A Unified Framework for Knowledge Graph Question Answering with Large Language Models [4.627548680442906]
OwnThink is the most extensive Chinese open-domain knowledge graph introduced in recent times.
We introduce UniOQA, a unified framework that integrates two parallel approaches to question answering.
UniOQA notably advances SpCQL Logical Accuracy to 21.2% and Execution Accuracy to 54.9%, achieving the new state-of-the-art results on this benchmark.
arXiv Detail & Related papers (2024-06-04T08:36:39Z) - Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a QPP framework using automatically generated relevance judgments (QPP-GenRE)
QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query.
This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels.
arXiv Detail & Related papers (2024-04-01T09:33:05Z) - CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering [14.366087533102656]
Question answering (QA) can only make progress if we know if an answer is correct.
Current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments.
arXiv Detail & Related papers (2024-01-24T01:30:25Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - Evaluation of Question Answering Systems: Complexity of judging a
natural language [3.4771957347698583]
Question answering (QA) systems are among the most important and rapidly developing research topics in natural language processing (NLP)
This survey attempts to provide a systematic overview of the general framework of QA, QA paradigms, benchmark datasets, and assessment techniques for a quantitative evaluation of QA systems.
arXiv Detail & Related papers (2022-09-10T12:29:04Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.