Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models
- URL: http://arxiv.org/abs/2502.11028v1
- Date: Sun, 16 Feb 2025 07:46:09 GMT
- Title: Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models
- Authors: Prateek Chhikara,
- Abstract summary: This paper examines how model size, mitigateors, and question types affect confidence alignment.
We introduce an evaluation framework to measure overconfidence and investigate whether multiple-choice formats or worsen miscalibration.
- Score: 0.6091702876917281
- License:
- Abstract: Large Language Models (LLMs) demonstrate impressive performance across diverse tasks, yet confidence calibration remains a challenge. Miscalibration - where models are overconfident or underconfident - poses risks, particularly in high-stakes applications. This paper presents an empirical study on LLM calibration, examining how model size, distractors, and question types affect confidence alignment. We introduce an evaluation framework to measure overconfidence and investigate whether multiple-choice formats mitigate or worsen miscalibration. Our findings show that while larger models (e.g., GPT-4o) are better calibrated overall, they are more prone to distraction, whereas smaller models benefit more from answer choices but struggle with uncertainty estimation. Unlike prior work, which primarily reports miscalibration trends, we provide actionable insights into failure modes and conditions that worsen overconfidence. These findings highlight the need for calibration-aware interventions and improved uncertainty estimation methods.
Related papers
- The Reliability Paradox: Exploring How Shortcut Learning Undermines Language Model Calibration [5.616884466478886]
Pre-trained language models (PLMs) have enabled significant performance gains in the field of natural language processing.
Recent studies have found PLMs to suffer from miscalibration, indicating a lack of accuracy in the confidence estimates provided by these models.
This paper investigates whether lower calibration error implies reliable decision rules for a language model.
arXiv Detail & Related papers (2024-12-17T08:04:28Z) - Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration [20.049443396032423]
Black-box large language models (LLMs) are increasingly deployed in various environments.
LLMs often exhibit overconfidence, leading to potential risks and misjudgments.
We propose a novel method, textitAtypical presentations Recalibration, which leverages atypical presentations to adjust the model's confidence estimates.
arXiv Detail & Related papers (2024-09-05T03:45:35Z) - Revisiting Confidence Estimation: Towards Reliable Failure Prediction [53.79160907725975]
We find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors.
We propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance.
arXiv Detail & Related papers (2024-03-05T11:44:14Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - Multi-Perspective Consistency Enhances Confidence Estimation in Large
Language Models [27.63938857490995]
This work focuses on improving the confidence estimation of large language models.
Considering the fragility of self-awareness in language models, we introduce a Multi-Perspective Consistency (MPC) method.
The experimental results on eight publicly available datasets show that our MPC achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-02-17T13:37:39Z) - Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance.
We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z) - On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models.
We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process.
Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z) - Improving the Reliability of Large Language Models by Leveraging
Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination"
We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z) - Two Sides of Miscalibration: Identifying Over and Under-Confidence
Prediction for Network Calibration [1.192436948211501]
Proper confidence calibration of deep neural networks is essential for reliable predictions in safety-critical tasks.
Miscalibration can lead to model over-confidence and/or under-confidence.
We introduce a novel metric, a miscalibration score, to identify the overall and class-wise calibration status.
We use the class-wise miscalibration score as a proxy to design a calibration technique that can tackle both over and under-confidence.
arXiv Detail & Related papers (2023-08-06T17:59:14Z) - On Calibrating Semantic Segmentation Models: Analyses and An Algorithm [51.85289816613351]
We study the problem of semantic segmentation calibration.
Model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration.
We propose a simple, unifying, and effective approach, namely selective scaling.
arXiv Detail & Related papers (2022-12-22T22:05:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.