Related papers: On Calibration of Large Language Models: From Response To Capability

On Calibration of Large Language Models: From Response To Capability

URL: http://arxiv.org/abs/2602.13540v1
Date: Sat, 14 Feb 2026 01:07:45 GMT
Title: On Calibration of Large Language Models: From Response To Capability
Authors: Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, Shao-Hua Sun,
Abstract summary: Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
Score: 66.59139960234326
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.

Related papers

Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns [67.24756301536617]
We propose a Generalized Correctness Model (GCM) to generate accurate and calibrated confidence estimates.<n>We first show that GCMs can be trained on the correctness data from many LLMs.<n>We then use CMs as a lens for studying the source of correctness prediction ability and its generalization.
arXiv Detail & Related papers (2025-09-29T16:19:01Z)
Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding [48.92310906093414]
We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs)<n>We leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models.<n>We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA.
arXiv Detail & Related papers (2025-04-30T19:19:21Z)
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models [0.6091702876917281]
Large Language Models (LLMs) show remarkable proficiency in natural language tasks.<n>Overconfidence-misalignment between predicted confidence and true correctness poses significant risks in critical decision-making applications.<n>We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering datasets.
arXiv Detail & Related papers (2025-02-16T07:46:09Z)
Fact-Level Confidence Calibration and Self-Correction [64.40105513819272]
We propose a Fact-Level framework that calibrates confidence to relevance-weighted correctness at the fact level. We also develop Confidence-Guided Fact-level Self-Correction ($textbfConFix$), which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones.
arXiv Detail & Related papers (2024-11-20T14:15:18Z)
Graph-based Confidence Calibration for Large Language Models [22.394717844099684]
We propose using an auxiliary learning model to assess response correctness based on the self-consistency of multiple outputs generated by the large language models.<n>Our method builds a consistency graph to represent the agreement among multiple responses and uses a graph neural network (GNN) to estimate the likelihood that each response is correct.
arXiv Detail & Related papers (2024-11-03T20:36:44Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. Results show that consistency-based calibration methods outperform existing post-hoc approaches. We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z)
Calibrating Long-form Generations from Large Language Models [34.72041258464477]
Large Language Models' (LLMs) confidence scores should align with the actual likelihood of its responses being correct. Current confidence elicitation methods and calibration metrics rely on a binary true/false assessment of response correctness. We introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores.
arXiv Detail & Related papers (2024-02-09T17:00:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.