Related papers: Unsupervised Evaluation for Question Answering with Transformers

Unsupervised Evaluation for Question Answering with Transformers

URL: http://arxiv.org/abs/2010.03222v1
Date: Wed, 7 Oct 2020 07:03:30 GMT
Title: Unsupervised Evaluation for Question Answering with Transformers
Authors: Lukas Muttenthaler, Isabelle Augenstein, Johannes Bjerva
Abstract summary: We investigate the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer is correct. We are able to predict whether or not a model's answer is correct with 91.37% accuracy SQuAD, and 80.7% accuracy on SubjQA.
Score: 46.16837670041594
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalize. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer span is correct. Our method does not require any labeled data and outperforms strong heuristic baselines, across 2 datasets and 7 domains. We are able to predict whether or not a model's answer is correct with 91.37% accuracy on SQuAD, and 80.7% accuracy on SubjQA. We expect that this method will have broad applications, e.g., in the semi-automatic development of QA datasets

Related papers

Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs. Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z)
Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times. We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model. We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z)
Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers. Our tool helps test generalization capabilities of models across multiple datasets. Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z)
Will this Question be Answered? Question Filtering via Answer Model Distillation for Efficient Question Answering [99.66470885217623]
We propose a novel approach towards improving the efficiency of Question Answering (QA) systems by filtering out questions that will not be answered by them. This is based on an interesting new finding: the answer confidence scores of state-of-the-art QA systems can be approximated well by models solely using the input question text.
arXiv Detail & Related papers (2021-09-14T23:07:49Z)
Determining Question-Answer Plausibility in Crowdsourced Datasets Using Multi-Task Learning [10.742152224470317]
We propose a novel task for automated quality analysis and data cleaning: question-answer (QA) plausibility. Given a machine or user-generated question and a crowd-sourced response from a social media user, we determine if the question and response are valid. We evaluate the ability of our models to generate a clean, usable question-answer dataset.
arXiv Detail & Related papers (2020-11-10T04:11:44Z)
Selective Question Answering under Domain Shift [90.021577320085]
Abstention policies based solely on the model's softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs. We train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely. Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model's probabilities only answers 48% at 80% accuracy.
arXiv Detail & Related papers (2020-06-16T19:13:21Z)
Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data. We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.