A Study on the Calibration of In-context Learning
- URL: http://arxiv.org/abs/2312.04021v4
- Date: Thu, 28 Mar 2024 03:01:45 GMT
- Title: A Study on the Calibration of In-context Learning
- Authors: Hanlin Zhang, Yi-Fan Zhang, Yaodong Yu, Dhruv Madeka, Dean Foster, Eric Xing, Himabindu Lakkaraju, Sham Kakade,
- Abstract summary: We study in-context learning (ICL), a prevalent method for adapting static language models through tailored prompts.
We observe that, with an increasing number of ICL examples, models initially exhibit increased miscalibration before achieving better calibration.
We explore recalibration techniques and find that a scaling-binning calibrator can reduce calibration errors consistently.
- Score: 27.533223818505682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate uncertainty quantification is crucial for the safe deployment of machine learning models, and prior research has demonstrated improvements in the calibration of modern language models (LMs). We study in-context learning (ICL), a prevalent method for adapting static LMs through tailored prompts, and examine the balance between performance and calibration across a broad spectrum of natural language understanding and reasoning tasks. Through comprehensive experiments, we observe that, with an increasing number of ICL examples, models initially exhibit increased miscalibration before achieving better calibration and miscalibration tends to arise in low-shot settings. Moreover, we find that methods aimed at improving usability, such as fine-tuning and chain-of-thought (CoT) prompting, can lead to miscalibration and unreliable natural language explanations. Furthermore, we explore recalibration techniques and find that a scaling-binning calibrator can reduce calibration errors consistently.
Related papers
- Reassessing How to Compare and Improve the Calibration of Machine Learning Models [7.183341902583164]
A machine learning model is calibrated if its predicted probability for an outcome matches the observed frequency for that outcome conditional on the model prediction.
We show that there exist trivial recalibration approaches that can appear seemingly state-of-the-art unless calibration and prediction metrics are accompanied by additional generalization metrics.
arXiv Detail & Related papers (2024-06-06T13:33:45Z) - Probabilistic Calibration by Design for Neural Network Regression [2.3020018305241337]
We introduce a novel end-to-end model training procedure called Quantile Recalibration Training.
We demonstrate the performance of our method in a large-scale experiment involving 57 regression datasets.
arXiv Detail & Related papers (2024-03-18T17:04:33Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - On Task Performance and Model Calibration with Supervised and
Self-Ensembled In-Context Learning [71.44986275228747]
In-context learning (ICL) has become an efficient approach propelled by the recent advancements in large language models (LLMs)
However, both paradigms are prone to suffer from the critical problem of overconfidence (i.e., miscalibration)
arXiv Detail & Related papers (2023-12-21T11:55:10Z) - On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models.
We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process.
Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z) - Calibration by Distribution Matching: Trainable Kernel Calibration
Metrics [56.629245030893685]
We introduce kernel-based calibration metrics that unify and generalize popular forms of calibration for both classification and regression.
These metrics admit differentiable sample estimates, making it easy to incorporate a calibration objective into empirical risk minimization.
We provide intuitive mechanisms to tailor calibration metrics to a decision task, and enforce accurate loss estimation and no regret decisions.
arXiv Detail & Related papers (2023-10-31T06:19:40Z) - On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration.
Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration.
We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z) - Variable-Based Calibration for Machine Learning Classifiers [11.9995808096481]
We introduce the notion of variable-based calibration to characterize calibration properties of a model.
We find that models with near-perfect expected calibration error can exhibit significant miscalibration as a function of features of the data.
arXiv Detail & Related papers (2022-09-30T00:49:31Z) - Modular Conformal Calibration [80.33410096908872]
We introduce a versatile class of algorithms for recalibration in regression.
This framework allows one to transform any regression model into a calibrated probabilistic model.
We conduct an empirical study of MCC on 17 regression datasets.
arXiv Detail & Related papers (2022-06-23T03:25:23Z) - Meta-Calibration: Learning of Model Calibration Using Differentiable
Expected Calibration Error [46.12703434199988]
We introduce a new differentiable surrogate for expected calibration error (DECE) that allows calibration quality to be directly optimised.
We also propose a meta-learning framework that uses DECE to optimise for validation set calibration.
arXiv Detail & Related papers (2021-06-17T15:47:50Z) - On Calibration of Scene-Text Recognition Models [16.181357648680365]
We analyze several recent STR methods and show that they are consistently overconfident.
We demonstrate that for attention based decoders, calibration of individual character predictions increases word-level calibration error.
arXiv Detail & Related papers (2020-12-23T13:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.