Related papers: Selective Question Answering under Domain Shift

Selective Question Answering under Domain Shift

URL: http://arxiv.org/abs/2006.09462v1
Date: Tue, 16 Jun 2020 19:13:21 GMT
Title: Selective Question Answering under Domain Shift
Authors: Amita Kamath, Robin Jia, Percy Liang
Abstract summary: Abstention policies based solely on the model's softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs. We train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely. Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model's probabilities only answers 48% at 80% accuracy.
Score: 90.021577320085
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To avoid giving wrong answers, question answering (QA) models need to know when to abstain from answering. Moreover, users often ask questions that diverge from the model's training data, making errors more likely and thus abstention more critical. In this work, we propose the setting of selective question answering under domain shift, in which a QA model is tested on a mixture of in-domain and out-of-domain data, and must answer (i.e., not abstain on) as many questions as possible while maintaining high accuracy. Abstention policies based solely on the model's softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs. Instead, we train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely. Crucially, the calibrator benefits from observing the model's behavior on out-of-domain data, even if from a different domain than the test data. We combine this method with a SQuAD-trained QA model and evaluate on mixtures of SQuAD and five other QA datasets. Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model's probabilities only answers 48% at 80% accuracy.

Related papers

Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs. Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z)
Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times. We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model. We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z)
Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly [100.60560477391732]
We promote a problem formulation for reliable visual question answering (VQA) We analyze both their coverage, the portion of questions answered, and risk, the error on that portion. We find that although the best performing models achieve over 71% accuracy on the VQA v2 dataset, introducing the option to abstain limits them to answering less than 8% of the questions to achieve a low risk of error (i.e., 1%) This motivates us to utilize a multimodal selection function to directly estimate the correctness of the predicted answers, which we show can triple the coverage from, for example, 5.0% to 16.7% at
arXiv Detail & Related papers (2022-04-28T16:51:27Z)
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)
Unsupervised Evaluation for Question Answering with Transformers [46.16837670041594]
We investigate the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer is correct. We are able to predict whether or not a model's answer is correct with 91.37% accuracy SQuAD, and 80.7% accuracy on SubjQA.
arXiv Detail & Related papers (2020-10-07T07:03:30Z)
Robust Question Answering Through Sub-part Alignment [53.94003466761305]
We model question answering as an alignment problem. We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets.
arXiv Detail & Related papers (2020-04-30T09:10:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.