Graph-based Confidence Calibration for Large Language Models
- URL: http://arxiv.org/abs/2411.02454v2
- Date: Thu, 22 May 2025 02:24:27 GMT
- Title: Graph-based Confidence Calibration for Large Language Models
- Authors: Yukun Li, Sijia Wang, Lifu Huang, Li-Ping Liu,
- Abstract summary: We propose using an auxiliary learning model to assess response correctness based on the self-consistency of multiple outputs generated by the large language models.<n>Our method builds a consistency graph to represent the agreement among multiple responses and uses a graph neural network (GNN) to estimate the likelihood that each response is correct.
- Score: 22.394717844099684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliable confidence estimation is essential for enhancing the trustworthiness of large language models (LLMs), especially in high-stakes scenarios. Despite its importance, accurately estimating confidence in LLM responses remains a significant challenge. In this work, we propose using an auxiliary learning model to assess response correctness based on the self-consistency of multiple outputs generated by the LLM. Our method builds a consistency graph to represent the agreement among multiple responses and uses a graph neural network (GNN) to estimate the likelihood that each response is correct. Experiments demonstrate that this method has strong calibration performance on various benchmark datasets and generalizes well to out-of-domain cases.
Related papers
- LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration [0.0]
Large Language Models (LLMs) have demonstrated impressive performance across various tasks.<n>We propose LENS (Learning ENsemble confidence from Neural States), a novel approach that learns to estimate model confidence by analyzing internal representations.
arXiv Detail & Related papers (2025-07-31T00:35:45Z) - Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation [26.580361841501514]
Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration.
This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information.
We propose a novel Confidence through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for object-centric queries.
arXiv Detail & Related papers (2025-04-21T04:01:22Z) - Rewarding Doubt: A Reinforcement Learning Approach to Confidence Calibration of Large Language Models [34.59785123314865]
A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers.
We introduce a novel Reinforcement Learning (RL) approach for LLM calibration that fine-tunes LLMs to elicit calibrated confidence estimations in their answers to factual questions.
arXiv Detail & Related papers (2025-03-04T13:48:50Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Efficient Test-Time Scaling via Self-Calibration [18.32718448734639]
Best-of-N sampling and Self-Consistency with majority voting are simple and effective, but require a fixed number of sampling responses for each query.
This could result in wasted computation for simpler questions and insufficient exploration for more challenging ones.
We argue that model confidence of responses can be used for improving the efficiency of test-time scaling.
arXiv Detail & Related papers (2025-02-25T00:21:14Z) - CER: Confidence Enhanced Reasoning in LLMs [2.4392539322920763]
We introduce an uncertainty-aware framework designed to enhance the accuracy of Large Language Models responses.<n>We quantify the confidence of intermediate answers such as numerical results in mathematical reasoning and proper nouns in open-domain generation.<n>Results consistently validate the effectiveness of our novel confidence aggregation method.
arXiv Detail & Related papers (2025-02-20T15:16:42Z) - Confidence Diagram of Nonparametric Ranking for Uncertainty Assessment in Large Language Models Evaluation [20.022623972491733]
Ranking large language models (LLMs) has proven to be an effective tool to improve alignment based on the best-of-$N$ policy.<n>We propose a new inferential framework for hypothesis testing among the ranking for language models.
arXiv Detail & Related papers (2024-12-07T02:34:30Z) - Fact-Level Confidence Calibration and Self-Correction [64.40105513819272]
We propose a Fact-Level framework that calibrates confidence to relevance-weighted correctness at the fact level.
We also develop Confidence-Guided Fact-level Self-Correction ($textbfConFix$), which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones.
arXiv Detail & Related papers (2024-11-20T14:15:18Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Multicalibration for Confidence Scoring in LLMs [6.948522445499497]
This paper proposes the use of "multicalibration" to yield interpretable and reliable confidence scores for outputs generated by large language models (LLMs)
We show how to form groupings for prompt/completion pairs that are correlated with the probability of correctness via two techniques: clustering within an embedding space, and "self-annotation"
We show how our techniques can yield confidence scores that provide substantial improvements in fine-grained measures of both calibration and accuracy compared to existing methods.
arXiv Detail & Related papers (2024-04-06T17:33:37Z) - Calibrating Large Language Models Using Their Generations Only [44.26441565763495]
APRICOT is a method to set confidence targets and train an additional model that predicts an LLM's confidence based on its textual input and output alone.
It is conceptually simple, does not require access to the target model beyond its output, does not interfere with the language generation, and has a multitude of potential usages.
We show how our approach performs competitively in terms of calibration error for white-box and black-box LLMs on closed-book question-answering to detect incorrect LLM answers.
arXiv Detail & Related papers (2024-03-09T17:46:24Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance.
We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z) - Calibrating Long-form Generations from Large Language Models [34.72041258464477]
Large Language Models' (LLMs) confidence scores should align with the actual likelihood of its responses being correct.
Current confidence elicitation methods and calibration metrics rely on a binary true/false assessment of response correctness.
We introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores.
arXiv Detail & Related papers (2024-02-09T17:00:32Z) - Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification.
We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate.
We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z) - Improving the Reliability of Large Language Models by Leveraging
Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination"
We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z) - Trustworthy Multimodal Regression with Mixture of Normal-inverse Gamma
Distributions [91.63716984911278]
We introduce a novel Mixture of Normal-Inverse Gamma distributions (MoNIG) algorithm, which efficiently estimates uncertainty in principle for adaptive integration of different modalities and produces a trustworthy regression result.
Experimental results on both synthetic and different real-world data demonstrate the effectiveness and trustworthiness of our method on various multimodal regression tasks.
arXiv Detail & Related papers (2021-11-11T14:28:12Z) - Fast Adaptively Weighted Matrix Factorization for Recommendation with
Implicit Feedback [28.30678887024847]
How to assign confidence weights and how to handle the large number of unobserved data are two key problems for implicit recommendation models.
We propose a fast adaptively weighted matrix factorization (FAWMF) based on variational auto-encoder.
Experiments on real-world datasets demonstrate the superiority of the proposed FAWMF and its learning algorithm fBGD.
arXiv Detail & Related papers (2020-03-04T04:50:44Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.