Calibrating Large Language Models with Sample Consistency
- URL: http://arxiv.org/abs/2402.13904v1
- Date: Wed, 21 Feb 2024 16:15:20 GMT
- Title: Calibrating Large Language Models with Sample Consistency
- Authors: Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar,
Niket Tandon, Marianna Apidianaki, Mrinmaya Sachan, Chris Callison-Burch
- Abstract summary: We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
- Score: 76.23956851098598
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Accurately gauging the confidence level of Large Language Models' (LLMs)
predictions is pivotal for their reliable application. However, LLMs are often
uncalibrated inherently and elude conventional calibration techniques due to
their proprietary nature and massive scale. In this work, we explore the
potential of deriving confidence from the distribution of multiple randomly
sampled model generations, via three measures of consistency. We perform an
extensive evaluation across various open and closed-source models on nine
reasoning datasets. Results show that consistency-based calibration methods
outperform existing post-hoc approaches. Meanwhile, we find that factors such
as intermediate explanations, model scaling, and larger sample sizes enhance
calibration, while instruction-tuning makes calibration more difficult.
Moreover, confidence scores obtained from consistency have the potential to
enhance model performance. Finally, we offer practical guidance on choosing
suitable consistency metrics for calibration, tailored to the characteristics
of various LMs.
Related papers
- Large Language Models Must Be Taught to Know What They Don't Know [97.90008709512921]
We show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead.
We also investigate the mechanisms that enable reliable uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators.
arXiv Detail & Related papers (2024-06-12T16:41:31Z) - Few-Shot Recalibration of Language Models [23.829795148520834]
We train a recalibration model that takes in a few unlabeled examples from any given slice and predicts a curve that remaps confidence scores to be more accurate for that slice.
Our trained model can recalibrate for arbitrary new slices, without using any labeled data from that slice.
Experiments show that our few-shot recalibrator consistently outperforms existing calibration methods.
arXiv Detail & Related papers (2024-03-27T06:25:40Z) - Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance.
We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z) - Calibrating Long-form Generations from Large Language Models [37.2496541665881]
Large Language Models' (LLMs) confidence scores should align with the actual likelihood of its responses being correct.
Current confidence elicitation methods and calibration metrics rely on a binary true/false assessment of response correctness.
We introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores.
arXiv Detail & Related papers (2024-02-09T17:00:32Z) - On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models.
We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process.
Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z) - Calibration in Deep Learning: A Survey of the State-of-the-Art [7.6087138685470945]
Calibrating deep neural models plays an important role in building reliable, robust AI systems in safety-critical applications.
Recent work has shown that modern neural networks that possess high predictive capability are poorly calibrated and produce unreliable model predictions.
arXiv Detail & Related papers (2023-08-02T15:28:10Z) - Calibrating Multimodal Learning [94.65232214643436]
We propose a novel regularization technique, i.e., Calibrating Multimodal Learning (CML) regularization, to calibrate the predictive confidence of previous methods.
This technique could be flexibly equipped by existing models and improve the performance in terms of confidence calibration, classification accuracy, and model robustness.
arXiv Detail & Related papers (2023-06-02T04:29:57Z) - On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration.
Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration.
We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z) - Calibrate: Interactive Analysis of Probabilistic Model Output [5.444048397001003]
We present Calibrate, an interactive reliability diagram that is resistant to drawbacks in traditional approaches.
We demonstrate the utility of Calibrate through use cases on both real-world and synthetic data.
arXiv Detail & Related papers (2022-07-27T20:01:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.