AVA: an Automatic eValuation Approach to Question Answering Systems
- URL: http://arxiv.org/abs/2005.00705v1
- Date: Sat, 2 May 2020 05:00:16 GMT
- Title: AVA: an Automatic eValuation Approach to Question Answering Systems
- Authors: Thuy Vu and Alessandro Moschitti
- Abstract summary: AVA uses Transformer-based language models to encode question, answer, and reference text.
Our solutions achieve up to 74.7% in F1 score in predicting human judgement for single answers.
- Score: 123.36351076384479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce AVA, an automatic evaluation approach for Question Answering,
which given a set of questions associated with Gold Standard answers, can
estimate system Accuracy. AVA uses Transformer-based language models to encode
question, answer, and reference text. This allows for effectively measuring the
similarity between the reference and an automatic answer, biased towards the
question semantics. To design, train and test AVA, we built multiple large
training, development, and test sets on both public and industrial benchmarks.
Our innovative solutions achieve up to 74.7% in F1 score in predicting human
judgement for single answers. Additionally, AVA can be used to evaluate the
overall system Accuracy with an RMSE, ranging from 0.02 to 0.09, depending on
the availability of multiple references.
Related papers
- Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation [69.81654421834989]
We introduce Auto, an agentic framework that automatically converts open-ended questions into multiple-choice format.
Using Auto, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format.
We evaluate 33 state-of-the-art vision language models on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
arXiv Detail & Related papers (2025-01-06T18:57:31Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - Multiple-Choice Question Generation: Towards an Automated Assessment
Framework [0.0]
transformer-based pretrained language models have demonstrated the ability to produce appropriate questions from a context paragraph.
We focus on a fully automated multiple-choice question generation (MCQG) system where both the question and possible answers must be generated from the context paragraph.
arXiv Detail & Related papers (2022-09-23T19:51:46Z) - Using Sampling to Estimate and Improve Performance of Automated Scoring
Systems with Guarantees [63.62448343531963]
We propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently.
We observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget.
arXiv Detail & Related papers (2021-11-17T05:00:51Z) - Will this Question be Answered? Question Filtering via Answer Model
Distillation for Efficient Question Answering [99.66470885217623]
We propose a novel approach towards improving the efficiency of Question Answering (QA) systems by filtering out questions that will not be answered by them.
This is based on an interesting new finding: the answer confidence scores of state-of-the-art QA systems can be approximated well by models solely using the input question text.
arXiv Detail & Related papers (2021-09-14T23:07:49Z) - Get It Scored Using AutoSAS -- An Automated System for Scoring Short
Answers [63.835172924290326]
We present a fast, scalable, and accurate approach towards automated Short Answer Scoring (SAS)
We propose and explain the design and development of a system for SAS, namely AutoSAS.
AutoSAS shows state-of-the-art performance and achieves better results by over 8% in some of the question prompts.
arXiv Detail & Related papers (2020-12-21T10:47:30Z) - Unsupervised Evaluation for Question Answering with Transformers [46.16837670041594]
We investigate the hidden representations of questions, answers, and contexts in transformer-based QA architectures.
We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer is correct.
We are able to predict whether or not a model's answer is correct with 91.37% accuracy SQuAD, and 80.7% accuracy on SubjQA.
arXiv Detail & Related papers (2020-10-07T07:03:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.