How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering
- URL: http://arxiv.org/abs/2012.00955v2
- Date: Thu, 20 May 2021 09:05:03 GMT
- Title: How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering
- Authors: Zhengbao Jiang, Jun Araki, Haibo Ding, Graham Neubig
- Abstract summary: We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
- Score: 80.82194311274694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have shown that language models (LM) capture different types of
knowledge regarding facts or common sense. However, because no model is
perfect, they still fail to provide appropriate answers in many cases. In this
paper, we ask the question "how can we know when language models know, with
confidence, the answer to a particular query?" We examine this question from
the point of view of calibration, the property of a probabilistic model's
predicted probabilities actually being well correlated with the probabilities
of correctness. We examine three strong generative models -- T5, BART, and
GPT-2 -- and study whether their probabilities on QA tasks are well calibrated,
finding the answer is a relatively emphatic no. We then examine methods to
calibrate such models to make their confidence scores correlate better with the
likelihood of correctness through fine-tuning, post-hoc probability
modification, or adjustment of the predicted outputs or inputs. Experiments on
a diverse range of datasets demonstrate the effectiveness of our methods. We
also perform analysis to study the strengths and limitations of these methods,
shedding light on further improvements that may be made in methods for
calibrating LMs. We have released the code at
https://github.com/jzbjyb/lm-calibration.
Related papers
- LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models [69.68379406317682]
We introduce a listener-aware finetuning method (LACIE) to calibrate implicit and explicit confidence markers.
We show that LACIE models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener.
We find that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers.
arXiv Detail & Related papers (2024-05-31T17:16:38Z) - Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs.
Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z) - Realistic Conversational Question Answering with Answer Selection based
on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times.
We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model.
We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z) - Calibration Meets Explanation: A Simple and Effective Approach for Model
Confidence Estimates [21.017890579840145]
We propose a method named CME that leverages model explanations to make the model less confident with non-inductive attributions.
We conduct extensive experiments on six datasets with two popular pre-trained language models.
Our findings highlight that model explanations can help calibrate posterior estimates.
arXiv Detail & Related papers (2022-11-06T06:17:21Z) - A Close Look into the Calibration of Pre-trained Language Models [56.998539510508515]
Pre-trained language models (PLMs) may fail in giving reliable estimates of their predictive uncertainty.
We study the dynamic change in PLMs' calibration performance in training.
We extend two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations.
arXiv Detail & Related papers (2022-10-31T21:31:07Z) - Language Models (Mostly) Know What They Know [10.836210010868932]
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly.
We investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer.
arXiv Detail & Related papers (2022-07-11T22:59:39Z) - Teaching Models to Express Their Uncertainty in Words [6.356472059420951]
We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language.
This is the first time a model has been shown to express calibrated uncertainty about its own answers in natural language.
arXiv Detail & Related papers (2022-05-28T05:02:31Z) - Selective Question Answering under Domain Shift [90.021577320085]
Abstention policies based solely on the model's softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs.
We train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely.
Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model's probabilities only answers 48% at 80% accuracy.
arXiv Detail & Related papers (2020-06-16T19:13:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.