Related papers: Revisiting Uncertainty Estimation and Calibration of Large Language Models

Revisiting Uncertainty Estimation and Calibration of Large Language Models

URL: http://arxiv.org/abs/2505.23854v1
Date: Thu, 29 May 2025 02:04:49 GMT
Title: Revisiting Uncertainty Estimation and Calibration of Large Language Models
Authors: Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Philip Torr, Chang Xu,
Abstract summary: We present the most comprehensive study to date of uncertainty estimation in large language models (LLMs)<n>We focus on three representative black-box single-pass methods, including token probability-based uncertainty (TPU), numerical verbal uncertainty (NVU) and linguistic verbal uncertainty (LVU)<n>Our results show that LVU consistently outperforms TPU and NVU, offering stronger calibration and discrimination while being more interpretable.
Score: 28.493449764136518
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) are increasingly deployed in high-stakes applications, robust uncertainty estimation is essential for ensuring the safe and trustworthy deployment of LLMs. We present the most comprehensive study to date of uncertainty estimation in LLMs, evaluating 80 models spanning open- and closed-source families, dense and Mixture-of-Experts (MoE) architectures, reasoning and non-reasoning modes, quantization variants and parameter scales from 0.6B to 671B. Focusing on three representative black-box single-pass methods, including token probability-based uncertainty (TPU), numerical verbal uncertainty (NVU), and linguistic verbal uncertainty (LVU), we systematically evaluate uncertainty calibration and selective classification using the challenging MMLU-Pro benchmark, which covers both reasoning-intensive and knowledge-based tasks. Our results show that LVU consistently outperforms TPU and NVU, offering stronger calibration and discrimination while being more interpretable. We also find that high accuracy does not imply reliable uncertainty, and that model scale, post-training, reasoning ability and quantization all influence estimation performance. Notably, LLMs exhibit better uncertainty estimates on reasoning tasks than on knowledge-heavy ones, and good calibration does not necessarily translate to effective error ranking. These findings highlight the need for multi-perspective evaluation and position LVU as a practical tool for improving the reliability of LLMs in real-world settings.

Related papers

An Information-Theoretic Perspective on Multi-LLM Uncertainty Estimation [7.018119896897734]
Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings.<n>We propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs.<n>Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and naive ensemble baselines.
arXiv Detail & Related papers (2025-07-09T19:13:25Z)
Towards Harmonized Uncertainty Estimation for Large Language Models [22.58034272573749]
It is essential to quantify the reliability of their generations through uncertainty estimation.<n>We propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM's performance to adjust uncertainty scores.
arXiv Detail & Related papers (2025-05-25T10:17:57Z)
Token-Level Uncertainty Estimation for Large Language Model Reasoning [24.56760223952017]
Large Language Models (LLMs) have demonstrated impressive capabilities, but their output quality remains inconsistent across various application scenarios.<n>We propose a token-level uncertainty estimation framework to enable LLMs to self-assess and self-improve their generation quality in mathematical reasoning.
arXiv Detail & Related papers (2025-05-16T22:47:32Z)
Enhancing Trust in Large Language Models with Uncertainty-Aware Fine-Tuning [10.457661605916435]
Large language models (LLMs) have revolutionized the field of natural language processing with their impressive reasoning and question-answering capabilities.<n>LLMs are sometimes prone to generating credible-sounding but incorrect information, a phenomenon known as hallucinations.<n>We introduce a novel uncertainty-aware causal language modeling loss function, grounded in the principles of decision theory.
arXiv Detail & Related papers (2024-12-03T23:14:47Z)
ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees [68.33498595506941]
We introduce a novel uncertainty measure based on self-consistency theory. We then develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the CP algorithm. Empirical evaluations indicate that our uncertainty measure outperforms prior state-of-the-art methods.
arXiv Detail & Related papers (2024-06-29T17:33:07Z)
Large Language Models Must Be Taught to Know What They Don't Know [97.90008709512921]
We show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead.<n>We also investigate the mechanisms that enable reliable uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators.
arXiv Detail & Related papers (2024-06-12T16:41:31Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities [79.9629927171974]
Uncertainty in Large Language Models (LLMs) is crucial for applications where safety and reliability are important. We propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs.
arXiv Detail & Related papers (2024-05-30T12:42:05Z)
Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach [6.209293868095268]
We study the problem of uncertainty estimation and calibration for LLMs. We propose a supervised approach that leverages labeled datasets to estimate the uncertainty in LLMs' responses. Our method is easy to implement and adaptable to different levels of model accessibility including black box, grey box, and white box.
arXiv Detail & Related papers (2024-04-24T17:10:35Z)
Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations [63.330182403615886]
A major barrier towards the practical deployment of large language models (LLMs) is their lack of reliability. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety. In all three cases, models should ideally abstain from responding, much like humans, whose ability to understand uncertainty makes us refrain from answering questions we don't know.
arXiv Detail & Related papers (2024-04-16T23:56:38Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.