Related papers: Atomic Calibration of LLMs in Long-Form Generations

Atomic Calibration of LLMs in Long-Form Generations

URL: http://arxiv.org/abs/2410.13246v1
Date: Thu, 17 Oct 2024 06:09:26 GMT
Title: Atomic Calibration of LLMs in Long-Form Generations
Authors: Caiqi Zhang, Ruihan Yang, Zhisong Zhang, Xinting Huang, Sen Yang, Dong Yu, Nigel Collier,
Abstract summary: Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. We introduce atomic calibration, a novel approach that evaluates factuality calibration at a fine-grained level by breaking down long responses into atomic claims. Our experiments show that atomic calibration is well-suited for long-form generation and can also improve macro calibration results.
Score: 46.01229352035088
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications. Confidence calibration, which estimates the underlying uncertainty of model predictions, is essential to enhance the LLMs' trustworthiness. Existing research on LLM calibration has primarily focused on short-form tasks, providing a single confidence score at the response level (macro calibration). However, this approach is insufficient for long-form generations, where responses often contain more complex statements and may include both accurate and inaccurate information. Therefore, we introduce atomic calibration, a novel approach that evaluates factuality calibration at a fine-grained level by breaking down long responses into atomic claims. We classify confidence elicitation methods into discriminative and generative types and demonstrate that their combination can enhance calibration. Our extensive experiments on various LLMs and datasets show that atomic calibration is well-suited for long-form generation and can also improve macro calibration results. Additionally, atomic calibration reveals insightful patterns in LLM confidence throughout the generation process.

Related papers

Towards Objective Fine-tuning: How LLMs' Prior Knowledge Causes Potential Poor Calibration? [19.38577744626441]
Large Language Models (LLMs) often demonstrate poor calibration with confidence scores misaligned with actual performance.<n>Our research reveals that LLMs' prior knowledge causes potential poor calibration due to the ubiquitous presence of known data in real-world fine-tuning.<n>We propose CogCalib, a cognition-aware framework that applies targeted learning strategies according to the model's prior knowledge.
arXiv Detail & Related papers (2025-05-27T08:51:31Z)
Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding [48.92310906093414]
We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs)<n>We leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models.<n>We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA.
arXiv Detail & Related papers (2025-04-30T19:19:21Z)
The Reliability Paradox: Exploring How Shortcut Learning Undermines Language Model Calibration [5.616884466478886]
Pre-trained language models (PLMs) have enabled significant performance gains in the field of natural language processing. Recent studies have found PLMs to suffer from miscalibration, indicating a lack of accuracy in the confidence estimates provided by these models. This paper investigates whether lower calibration error implies reliable decision rules for a language model.
arXiv Detail & Related papers (2024-12-17T08:04:28Z)
Fact-Level Confidence Calibration and Self-Correction [64.40105513819272]
We propose a Fact-Level framework that calibrates confidence to relevance-weighted correctness at the fact level. We also develop Confidence-Guided Fact-level Self-Correction ($textbfConFix$), which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones.
arXiv Detail & Related papers (2024-11-20T14:15:18Z)
Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration [20.049443396032423]
Black-box large language models (LLMs) are increasingly deployed in various environments. LLMs often exhibit overconfidence, leading to potential risks and misjudgments. We propose a novel method, textitAtypical presentations Recalibration, which leverages atypical presentations to adjust the model's confidence estimates.
arXiv Detail & Related papers (2024-09-05T03:45:35Z)
Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. Results show that consistency-based calibration methods outperform existing post-hoc approaches. We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z)
Thermometer: Towards Universal Calibration for Large Language Models [22.03852781949075]
We propose OMETER, a calibration approach tailored to large language models (LLM) OMETER learns an auxiliary model, given data from multiple tasks, for calibrating a LLM. It is computationally efficient, preserves the accuracy of the LLM, and produces better-calibrated responses for new tasks.
arXiv Detail & Related papers (2024-02-20T04:13:48Z)
Calibrating Long-form Generations from Large Language Models [34.72041258464477]
Large Language Models' (LLMs) confidence scores should align with the actual likelihood of its responses being correct. Current confidence elicitation methods and calibration metrics rely on a binary true/false assessment of response correctness. We introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores.
arXiv Detail & Related papers (2024-02-09T17:00:32Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
A Study on the Calibration of In-context Learning [27.533223818505682]
We study in-context learning (ICL), a prevalent method for adapting static language models through tailored prompts. We observe that, with an increasing number of ICL examples, models initially exhibit increased miscalibration before achieving better calibration. We explore recalibration techniques and find that a scaling-binning calibrator can reduce calibration errors consistently.
arXiv Detail & Related papers (2023-12-07T03:37:39Z)
On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models. We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process. Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z)
Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration. We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.