LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval
- URL: http://arxiv.org/abs/2510.26995v1
- Date: Thu, 30 Oct 2025 20:49:41 GMT
- Title: LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval
- Authors: Elliot L. Epstein, John Winnicki, Thanawat Sornwanee, Rajat Dwaraknath,
- Abstract summary: Large language models (LLMs) excel at numerical estimation but struggle to correctly quantify uncertainty.<n>We study how well LLMs construct confidence intervals around their own answers and find that they are systematically overconfident.
- Score: 0.3499870393443268
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) excel at numerical estimation but struggle to correctly quantify uncertainty. We study how well LLMs construct confidence intervals around their own answers and find that they are systematically overconfident. To evaluate this behavior, we introduce FermiEval, a benchmark of Fermi-style estimation questions with a rigorous scoring rule for confidence interval coverage and sharpness. Across several modern models, nominal 99\% intervals cover the true answer only 65\% of the time on average. With a conformal prediction based approach that adjusts the intervals, we obtain accurate 99\% observed coverage, and the Winkler interval score decreases by 54\%. We also propose direct log-probability elicitation and quantile adjustment methods, which further reduce overconfidence at high confidence levels. Finally, we develop a perception-tunnel theory explaining why LLMs exhibit overconfidence: when reasoning under uncertainty, they act as if sampling from a truncated region of their inferred distribution, neglecting its tails.
Related papers
- BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents [58.05949210993854]
We investigate whether search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions.<n>We propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level.
arXiv Detail & Related papers (2025-10-27T15:58:51Z) - ConfTuner: Training Large Language Models to Express Their Confidence Verbally [58.63318088243125]
Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare.<n>LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence"
arXiv Detail & Related papers (2025-08-26T09:25:32Z) - Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z) - Deterministic Object Pose Confidence Region Estimation [13.545295537964337]
6D pose confidence region estimation has emerged as a critical direction for uncertainty quantification.<n>Current sampling-based approach suffers from critical limitations that severely impede their practical deployment.<n>We propose a deterministic and efficient method for estimating pose confidence regions.
arXiv Detail & Related papers (2025-06-28T02:03:34Z) - Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence [16.311538811237536]
Large language models (LLMs) are increasingly used for factual question-answering.<n>For these verbalized expressions of uncertainty to be meaningful, they should reflect the error rates at the expressed level of confidence.<n>We propose a simple procedure, uncertainty distillation, to teach an LLM to calibrated semantic confidences.
arXiv Detail & Related papers (2025-03-18T21:29:29Z) - On Verbalized Confidence Scores for LLMs [25.160810008907397]
Uncertainty quantification for large language models (LLMs) can establish more human trust into their responses.<n>This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens.<n>We assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods.
arXiv Detail & Related papers (2024-12-19T11:10:36Z) - Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators [6.403926452181712]
Large Language Models (LLMs) tend to be unreliable in the factuality of their answers.
We present a survey and empirical comparison of estimators of factual confidence.
Our experiments indicate that trained hidden-state probes provide the most reliable confidence estimates.
arXiv Detail & Related papers (2024-06-19T10:11:37Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Reconfidencing LLMs from the Grouping Loss Perspective [56.801251926946485]
Large Language Models (LLMs) are susceptible to generating hallucinated answers in a confident tone.
Recent findings show that controlling uncertainty must go beyond calibration.
We construct a new evaluation dataset derived from a knowledge base to assess confidence scores given to answers of Mistral and LLaMA.
arXiv Detail & Related papers (2024-02-07T15:40:22Z) - Learning Confidence for Transformer-based Neural Machine Translation [38.679505127679846]
We propose an unsupervised confidence estimate learning jointly with the training of the neural machine translation (NMT) model.
We explain confidence as how many hints the NMT model needs to make a correct prediction, and more hints indicate low confidence.
We demonstrate that our learned confidence estimate achieves high accuracy on extensive sentence/word-level quality estimation tasks.
arXiv Detail & Related papers (2022-03-22T01:51:58Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.