Calibrating Long-form Generations from Large Language Models
- URL: http://arxiv.org/abs/2402.06544v1
- Date: Fri, 9 Feb 2024 17:00:32 GMT
- Title: Calibrating Long-form Generations from Large Language Models
- Authors: Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, Bhuwan
Dhingra
- Abstract summary: Large Language Models' (LLMs) confidence scores should align with the actual likelihood of its responses being correct.
Current confidence elicitation methods and calibration metrics rely on a binary true/false assessment of response correctness.
We introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores.
- Score: 37.2496541665881
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To enhance Large Language Models' (LLMs) reliability, calibration is
essential -- the model's assessed confidence scores should align with the
actual likelihood of its responses being correct. However, current confidence
elicitation methods and calibration metrics typically rely on a binary
true/false assessment of response correctness. This approach does not apply to
long-form generation, where an answer can be partially correct. Addressing this
gap, we introduce a unified calibration framework, in which both the
correctness of the LLMs' responses and their associated confidence levels are
treated as distributions across a range of scores. Within this framework, we
develop three metrics to precisely evaluate LLM calibration and further propose
two confidence elicitation methods based on self-consistency and
self-evaluation. Our experiments, which include long-form QA and summarization
tasks, demonstrate that larger models don't necessarily guarantee better
calibration, that calibration performance is found to be metric-dependent, and
that self-consistency methods excel in factoid datasets. We also find that
calibration can be enhanced through techniques such as fine-tuning, integrating
relevant source documents, scaling the temperature, and combining
self-consistency with self-evaluation. Lastly, we showcase a practical
application of our system: selecting and cascading open-source models and
ChatGPT to optimize correctness given a limited API budget. This research not
only challenges existing notions of LLM calibration but also offers practical
methodologies for improving trustworthiness in long-form generation.
Related papers
- Multicalibration for Confidence Scoring in LLMs [6.948522445499497]
This paper proposes the use of "multicalibration" to yield interpretable and reliable confidence scores for outputs generated by large language models (LLMs)
We show how to form groupings for prompt/completion pairs that are correlated with the probability of correctness via two techniques: clustering within an embedding space, and "self-annotation"
We show how our techniques can yield confidence scores that provide substantial improvements in fine-grained measures of both calibration and accuracy compared to existing methods.
arXiv Detail & Related papers (2024-04-06T17:33:37Z) - Self-Consistency Boosts Calibration for Math Reasoning [69.82896431282927]
We design three off-the-shelf calibration methods based on self-consistency for math reasoning tasks.
Our methods better bridge model confidence and accuracy than existing methods based on p(True) or logit.
arXiv Detail & Related papers (2024-03-14T20:17:10Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models.
We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process.
Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z) - Modular Conformal Calibration [80.33410096908872]
We introduce a versatile class of algorithms for recalibration in regression.
This framework allows one to transform any regression model into a calibrated probabilistic model.
We conduct an empirical study of MCC on 17 regression datasets.
arXiv Detail & Related papers (2022-06-23T03:25:23Z) - Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration.
We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z) - Calibrating Structured Output Predictors for Natural Language Processing [8.361023354729731]
We propose a general calibration scheme for output entities of interest in neural-network based structured prediction models.
Our proposed method can be used with any binary class calibration scheme and a neural network model.
We show that our method outperforms current calibration techniques for named-entity-recognition, part-of-speech and question answering.
arXiv Detail & Related papers (2020-04-09T04:14:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.