A Close Look into the Calibration of Pre-trained Language Models
- URL: http://arxiv.org/abs/2211.00151v3
- Date: Mon, 8 May 2023 05:22:46 GMT
- Title: A Close Look into the Calibration of Pre-trained Language Models
- Authors: Yangyi Chen, Lifan Yuan, Ganqu Cui, Zhiyuan Liu, Heng Ji
- Abstract summary: Pre-trained language models (PLMs) may fail in giving reliable estimates of their predictive uncertainty.
We study the dynamic change in PLMs' calibration performance in training.
We extend two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations.
- Score: 56.998539510508515
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Pre-trained language models (PLMs) may fail in giving reliable estimates of
their predictive uncertainty. We take a close look into this problem, aiming to
answer two questions: (1) Do PLMs learn to become calibrated in the training
process? (2) How effective are existing calibration methods? For the first
question, we conduct fine-grained control experiments to study the dynamic
change in PLMs' calibration performance in training. We consider six factors as
control variables, including dataset difficulty, available training samples,
training steps, the number of tunable parameters, model scale, and pretraining.
We observe a consistent change in calibration performance across six factors.
We find that PLMs don't learn to become calibrated in training, evidenced by
the continual increase in confidence, no matter whether the predictions are
correct or not. We highlight that our finding somewhat contradicts two
established conclusions: (a) Larger PLMs are more calibrated; (b) Pretraining
improves model calibration. Next, we study the effectiveness of existing
calibration methods in mitigating the overconfidence issue. Besides unlearnable
calibration methods (e.g., label smoothing), we adapt and extend two recently
proposed learnable methods that directly collect data to train models to have
reasonable confidence estimations. Experimental results show that learnable
methods significantly reduce PLMs' confidence in wrong predictions. The code is
available at \url{https://github.com/lifan-yuan/PLMCalibration}.
Related papers
- Feature Clipping for Uncertainty Calibration [24.465567005078135]
Modern deep neural networks (DNNs) often suffer from overconfidence, leading to miscalibration.
We propose a novel post-hoc calibration method called feature clipping (FC) to address this issue.
FC involves clipping feature values to a specified threshold, effectively increasing entropy in high calibration error samples.
arXiv Detail & Related papers (2024-10-16T06:44:35Z) - Calibrating Language Models with Adaptive Temperature Scaling [58.056023173579625]
We introduce Adaptive Temperature Scaling (ATS), a post-hoc calibration method that predicts a temperature scaling parameter for each token prediction.
ATS improves calibration by over 10-50% across three downstream natural language evaluation benchmarks compared to prior calibration methods.
arXiv Detail & Related papers (2024-09-29T22:54:31Z) - Does Alignment Tuning Really Break LLMs' Internal Confidence? [5.893124686141782]
Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration.
This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods.
arXiv Detail & Related papers (2024-08-31T05:12:36Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - Making Pre-trained Language Models both Task-solvers and
Self-calibrators [52.98858650625623]
Pre-trained language models (PLMs) serve as backbones for various real-world systems.
Previous work shows that introducing an extra calibration task can mitigate this issue.
We propose a training algorithm LM-TOAST to tackle the challenges.
arXiv Detail & Related papers (2023-07-21T02:51:41Z) - Bag of Tricks for In-Distribution Calibration of Pretrained Transformers [8.876196316390493]
We present an empirical study on confidence calibration for pre-trained language models (PLMs)
We find that the ensemble model overfitted to the training set shows sub-par calibration performance.
We propose the Calibrated PLM (CALL), a combination of calibration techniques.
arXiv Detail & Related papers (2023-02-13T21:11:52Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - Uncertainty Quantification and Deep Ensembles [79.4957965474334]
We show that deep-ensembles do not necessarily lead to improved calibration properties.
We show that standard ensembling methods, when used in conjunction with modern techniques such as mixup regularization, can lead to less calibrated models.
This text examines the interplay between three of the most simple and commonly used approaches to leverage deep learning when data is scarce.
arXiv Detail & Related papers (2020-07-17T07:32:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.