Unsupervised Evaluation for Question Answering with Transformers
- URL: http://arxiv.org/abs/2010.03222v1
- Date: Wed, 7 Oct 2020 07:03:30 GMT
- Title: Unsupervised Evaluation for Question Answering with Transformers
- Authors: Lukas Muttenthaler, Isabelle Augenstein, Johannes Bjerva
- Abstract summary: We investigate the hidden representations of questions, answers, and contexts in transformer-based QA architectures.
We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer is correct.
We are able to predict whether or not a model's answer is correct with 91.37% accuracy SQuAD, and 80.7% accuracy on SubjQA.
- Score: 46.16837670041594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is challenging to automatically evaluate the answer of a QA model at
inference time. Although many models provide confidence scores, and simple
heuristics can go a long way towards indicating answer correctness, such
measures are heavily dataset-dependent and are unlikely to generalize. In this
work, we begin by investigating the hidden representations of questions,
answers, and contexts in transformer-based QA architectures. We observe a
consistent pattern in the answer representations, which we show can be used to
automatically evaluate whether or not a predicted answer span is correct. Our
method does not require any labeled data and outperforms strong heuristic
baselines, across 2 datasets and 7 domains. We are able to predict whether or
not a model's answer is correct with 91.37% accuracy on SQuAD, and 80.7%
accuracy on SubjQA. We expect that this method will have broad applications,
e.g., in the semi-automatic development of QA datasets
Related papers
- Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs.
Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z) - Realistic Conversational Question Answering with Answer Selection based
on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times.
We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model.
We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z) - Beyond Accuracy: A Consolidated Tool for Visual Question Answering
Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers.
Our tool helps test generalization capabilities of models across multiple datasets.
Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z) - Will this Question be Answered? Question Filtering via Answer Model
Distillation for Efficient Question Answering [99.66470885217623]
We propose a novel approach towards improving the efficiency of Question Answering (QA) systems by filtering out questions that will not be answered by them.
This is based on an interesting new finding: the answer confidence scores of state-of-the-art QA systems can be approximated well by models solely using the input question text.
arXiv Detail & Related papers (2021-09-14T23:07:49Z) - Determining Question-Answer Plausibility in Crowdsourced Datasets Using
Multi-Task Learning [10.742152224470317]
We propose a novel task for automated quality analysis and data cleaning: question-answer (QA) plausibility.
Given a machine or user-generated question and a crowd-sourced response from a social media user, we determine if the question and response are valid.
We evaluate the ability of our models to generate a clean, usable question-answer dataset.
arXiv Detail & Related papers (2020-11-10T04:11:44Z) - Selective Question Answering under Domain Shift [90.021577320085]
Abstention policies based solely on the model's softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs.
We train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely.
Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model's probabilities only answers 48% at 80% accuracy.
arXiv Detail & Related papers (2020-06-16T19:13:21Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.