Related papers: Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models

Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models

URL: http://arxiv.org/abs/2510.20460v1
Date: Thu, 23 Oct 2025 11:50:47 GMT
Title: Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models
Authors: Christian Hobelsberger, Theresa Winner, Andreas Nawroth, Oliver Mitevski, Anna-Carolina Haensch,
Abstract summary: We evaluate four approaches for confidence estimation in large language models (LLMs)<n>We conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM.<n>Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall.
Score: 1.8374839804848957
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.

Related papers

Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations [49.84786015324238]
Confidence estimation (CE) indicates how reliable the answers of large language models (LLMs) are, and can impact user trust and decision-making.<n>We present a comprehensive evaluation framework for CE that measures their confidence quality on three new aspects.<n>These include robustness of confidence against prompt perturbations, stability across semantic equivalent answers, and sensitivity to semantically different answers.
arXiv Detail & Related papers (2026-01-12T23:16:50Z)
Towards Reliable LLM-based Robot Planning via Combined Uncertainty Estimation [68.106428321492]
Large language models (LLMs) demonstrate advanced reasoning abilities, enabling robots to understand natural language instructions and generate high-level plans with appropriate grounding.<n>LLMs hallucinations present a significant challenge, often leading to overconfident yet potentially misaligned or unsafe plans.<n>We present Combined Uncertainty estimation for Reliable Embodied planning (CURE), which decomposes the uncertainty into epistemic and intrinsic uncertainty, each estimated separately.
arXiv Detail & Related papers (2025-10-09T10:26:58Z)
Towards Harmonized Uncertainty Estimation for Large Language Models [22.58034272573749]
It is essential to quantify the reliability of their generations through uncertainty estimation.<n>We propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM's performance to adjust uncertainty scores.
arXiv Detail & Related papers (2025-05-25T10:17:57Z)
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels [16.300463494913593]
Large Language Models (LLMs) require robust confidence estimation.<n>McQCA-Eval is an evaluation framework for assessing confidence measures in Natural Language Generation.
arXiv Detail & Related papers (2025-02-20T05:09:29Z)
Rethinking LLM Uncertainty: A Multi-Agent Approach to Estimating Black-Box Model Uncertainty [47.95943057892318]
Quantifying uncertainty in black-box LLMs is vital for reliable responses and scalable oversight.<n>We introduce DiverseAgentEntropy, a novel, theoretically-grounded method employing multi-agent interaction for uncertainty estimation.
arXiv Detail & Related papers (2024-12-12T18:52:40Z)
SAUP: Situation Awareness Uncertainty Propagation on LLM Agent [52.444674213316574]
Large language models (LLMs) integrated into multistep agent systems enable complex decision-making processes across various applications.<n>Existing uncertainty estimation methods primarily focus on final-step outputs, which fail to account for cumulative uncertainty over the multistep decision-making process and the dynamic interactions between agents and their environments.<n>We propose SAUP, a novel framework that propagates uncertainty through each step of an LLM-based agent's reasoning process.
arXiv Detail & Related papers (2024-12-02T01:31:13Z)
Fact-Level Confidence Calibration and Self-Correction [64.40105513819272]
We propose a Fact-Level framework that calibrates confidence to relevance-weighted correctness at the fact level. We also develop Confidence-Guided Fact-level Self-Correction ($textbfConFix$), which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones.
arXiv Detail & Related papers (2024-11-20T14:15:18Z)
Black-box Uncertainty Quantification Method for LLM-as-a-Judge [13.45579129351493]
We introduce a novel method for quantifying uncertainty designed to enhance the trustworthiness of LLM-as-a-Judge evaluations. The method quantifies uncertainty by analyzing the relationships between generated assessments and possible ratings. By cross-evaluating these relationships and constructing a confusion matrix based on token probabilities, the method derives labels of high or low uncertainty.
arXiv Detail & Related papers (2024-10-15T13:29:22Z)
ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees [68.33498595506941]
We introduce a novel uncertainty measure based on self-consistency theory. We then develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the CP algorithm. Empirical evaluations indicate that our uncertainty measure outperforms prior state-of-the-art methods.
arXiv Detail & Related papers (2024-06-29T17:33:07Z)
Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach [6.209293868095268]
We study the problem of uncertainty estimation and calibration for LLMs. We propose a supervised approach that leverages labeled datasets to estimate the uncertainty in LLMs' responses. Our method is easy to implement and adaptable to different levels of model accessibility including black box, grey box, and white box.
arXiv Detail & Related papers (2024-04-24T17:10:35Z)
Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection [90.71323430635593]
We propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers. Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer. This framework can be seamlessly integrated with existing approaches for superior self-detection.
arXiv Detail & Related papers (2024-03-15T02:38:26Z)
A Survey of Confidence Estimation and Calibration in Large Language Models [86.692994151323]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks in various domains. Despite their impressive performance, they can be unreliable due to factual errors in their generations. Assessing their confidence and calibrating them across different tasks can help mitigate risks and enable LLMs to produce better generations.
arXiv Detail & Related papers (2023-11-14T16:43:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.