Related papers: Calibrating Long-form Generations from Large Language Models

Calibrating Long-form Generations from Large Language Models

URL: http://arxiv.org/abs/2402.06544v1
Date: Fri, 9 Feb 2024 17:00:32 GMT
Title: Calibrating Long-form Generations from Large Language Models
Authors: Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, Bhuwan Dhingra
Abstract summary: Large Language Models' (LLMs) confidence scores should align with the actual likelihood of its responses being correct. Current confidence elicitation methods and calibration metrics rely on a binary true/false assessment of response correctness. We introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores.
Score: 37.2496541665881
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To enhance Large Language Models' (LLMs) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. This approach does not apply to long-form generation, where an answer can be partially correct. Addressing this gap, we introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores. Within this framework, we develop three metrics to precisely evaluate LLM calibration and further propose two confidence elicitation methods based on self-consistency and self-evaluation. Our experiments, which include long-form QA and summarization tasks, demonstrate that larger models don't necessarily guarantee better calibration, that calibration performance is found to be metric-dependent, and that self-consistency methods excel in factoid datasets. We also find that calibration can be enhanced through techniques such as fine-tuning, integrating relevant source documents, scaling the temperature, and combining self-consistency with self-evaluation. Lastly, we showcase a practical application of our system: selecting and cascading open-source models and ChatGPT to optimize correctness given a limited API budget. This research not only challenges existing notions of LLM calibration but also offers practical methodologies for improving trustworthiness in long-form generation.

Related papers

SGIC: A Self-Guided Iterative Calibration Framework for RAG [45.17496149653415]
Large language models (LLMs) capitalize on their robust in-context reasoning.<n>We present a new framework that employs uncertainty scores as a tool.<n>We also introduce an innovative approach for constructing an iterative self-calibration training set.
arXiv Detail & Related papers (2025-06-19T09:45:13Z)
Know What You Don't Know: Uncertainty Calibration of Process Reward Models [8.958124143194512]
Even state-of-the-art PRMs can be poorly calibrated and often overestimate success probabilities.<n>We present a calibration approach, performed via quantile regression, that PRM outputs to better align with true success probabilities.
arXiv Detail & Related papers (2025-06-11T02:39:26Z)
Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding [48.92310906093414]
We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs)<n>We leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models.<n>We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA.
arXiv Detail & Related papers (2025-04-30T19:19:21Z)
Balancing Two Classifiers via A Simplex ETF Structure for Model Calibration [34.52946891778497]
Deep neural networks (DNNs) have demonstrated state-of-the-art performance across various domains. They often face calibration issues, particularly in safety-critical applications such as autonomous driving and healthcare. Recent research has started to improve model calibration from the view of the classifier.
arXiv Detail & Related papers (2025-04-14T09:09:01Z)
Fact-Level Confidence Calibration and Self-Correction [64.40105513819272]
We propose a Fact-Level framework that calibrates confidence to relevance-weighted correctness at the fact level. We also develop Confidence-Guided Fact-level Self-Correction ($textbfConFix$), which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones.
arXiv Detail & Related papers (2024-11-20T14:15:18Z)
Graph-based Confidence Calibration for Large Language Models [22.394717844099684]
We propose a novel method to develop a well-calibrated confidence estimation model. We use a weighted graph to represent the consistency among the large language models' responses to a question. We then train a graph neural network to estimate the probability of correct responses.
arXiv Detail & Related papers (2024-11-03T20:36:44Z)
Consistency Calibration: Improving Uncertainty Calibration via Consistency among Perturbed Neighbors [22.39558434131574]
We introduce the concept of consistency as an alternative perspective on model calibration. We propose a post-hoc calibration method called Consistency (CC) which adjusts confidence based on the model's consistency across inputs. We show that performing perturbations at the logit level significantly improves computational efficiency.
arXiv Detail & Related papers (2024-10-16T06:55:02Z)
Does Alignment Tuning Really Break LLMs' Internal Confidence? [5.893124686141782]
Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration. This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods.
arXiv Detail & Related papers (2024-08-31T05:12:36Z)
Self-Consistency Boosts Calibration for Math Reasoning [69.82896431282927]
We design three off-the-shelf calibration methods based on self-consistency for math reasoning tasks. Our methods better bridge model confidence and accuracy than existing methods based on p(True) or logit.
arXiv Detail & Related papers (2024-03-14T20:17:10Z)
Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. Results show that consistency-based calibration methods outperform existing post-hoc approaches. We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z)
On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models. We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process. Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z)
Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models. We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z)
Modular Conformal Calibration [80.33410096908872]
We introduce a versatile class of algorithms for recalibration in regression. This framework allows one to transform any regression model into a calibrated probabilistic model. We conduct an empirical study of MCC on 17 regression datasets.
arXiv Detail & Related papers (2022-06-23T03:25:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.