Related papers: AVA: an Automatic eValuation Approach to Question Answering Systems

AVA: an Automatic eValuation Approach to Question Answering Systems

URL: http://arxiv.org/abs/2005.00705v1
Date: Sat, 2 May 2020 05:00:16 GMT
Title: AVA: an Automatic eValuation Approach to Question Answering Systems
Authors: Thuy Vu and Alessandro Moschitti
Abstract summary: AVA uses Transformer-based language models to encode question, answer, and reference text. Our solutions achieve up to 74.7% in F1 score in predicting human judgement for single answers.
Score: 123.36351076384479
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce AVA, an automatic evaluation approach for Question Answering, which given a set of questions associated with Gold Standard answers, can estimate system Accuracy. AVA uses Transformer-based language models to encode question, answer, and reference text. This allows for effectively measuring the similarity between the reference and an automatic answer, biased towards the question semantics. To design, train and test AVA, we built multiple large training, development, and test sets on both public and industrial benchmarks. Our innovative solutions achieve up to 74.7% in F1 score in predicting human judgement for single answers. Additionally, AVA can be used to evaluate the overall system Accuracy with an RMSE, ranging from 0.02 to 0.09, depending on the availability of multiple references.

Related papers

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation [69.81654421834989]
We introduce Auto, an agentic framework that automatically converts open-ended questions into multiple-choice format. Our experiments demonstrate that Auto can correct and challenging multiple-choice questions, with similar or lower accuracy compared to human-created ones. We comprehensively evaluate 33 state-of-the-art vision language models (VLMs) on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
arXiv Detail & Related papers (2025-01-06T18:57:31Z)
Improving Automatic VQA Evaluation Using Large Language Models [6.468405905503242]
We propose to leverage the in-context learning capabilities of instruction-tuned large language models to build a better VQA metric. We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks.
arXiv Detail & Related papers (2023-10-04T03:59:57Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
Multiple-Choice Question Generation: Towards an Automated Assessment Framework [0.0]
transformer-based pretrained language models have demonstrated the ability to produce appropriate questions from a context paragraph. We focus on a fully automated multiple-choice question generation (MCQG) system where both the question and possible answers must be generated from the context paragraph.
arXiv Detail & Related papers (2022-09-23T19:51:46Z)
Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees [63.62448343531963]
We propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently. We observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget.
arXiv Detail & Related papers (2021-11-17T05:00:51Z)
Will this Question be Answered? Question Filtering via Answer Model Distillation for Efficient Question Answering [99.66470885217623]
We propose a novel approach towards improving the efficiency of Question Answering (QA) systems by filtering out questions that will not be answered by them. This is based on an interesting new finding: the answer confidence scores of state-of-the-art QA systems can be approximated well by models solely using the input question text.
arXiv Detail & Related papers (2021-09-14T23:07:49Z)
Get It Scored Using AutoSAS -- An Automated System for Scoring Short Answers [63.835172924290326]
We present a fast, scalable, and accurate approach towards automated Short Answer Scoring (SAS) We propose and explain the design and development of a system for SAS, namely AutoSAS. AutoSAS shows state-of-the-art performance and achieves better results by over 8% in some of the question prompts.
arXiv Detail & Related papers (2020-12-21T10:47:30Z)
Unsupervised Evaluation for Question Answering with Transformers [46.16837670041594]
We investigate the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer is correct. We are able to predict whether or not a model's answer is correct with 91.37% accuracy SQuAD, and 80.7% accuracy on SubjQA.
arXiv Detail & Related papers (2020-10-07T07:03:30Z)
MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering [58.30291671877342]
We present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input. MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a $10.57%$ improvement.
arXiv Detail & Related papers (2020-09-18T00:22:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.