Related papers: Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles

Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles

URL: http://arxiv.org/abs/2501.03991v1
Date: Tue, 07 Jan 2025 18:48:42 GMT
Title: Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles
Authors: Yuxi Xia, Pedro Henrique Luz de Araujo, Klim Zaporojets, Benjamin Roth,
Abstract summary: Calib-n is a novel framework that trains an auxiliary model for confidence estimation.<n>We find that few-shot prompts are the most effective for auxiliary model-based methods.
Score: 4.477423478591491
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Calibration, the alignment between model confidence and prediction accuracy, is critical for the reliable deployment of large language models (LLMs). Existing works neglect to measure the generalization of their methods to other prompt styles and different sizes of LLMs. To address this, we define a controlled experimental setting covering 12 LLMs and four prompt styles. We additionally investigate if incorporating the response agreement of multiple LLMs and an appropriate loss function can improve calibration performance. Concretely, we build Calib-n, a novel framework that trains an auxiliary model for confidence estimation that aggregates responses from multiple LLMs to capture inter-model agreement. To optimize calibration, we integrate focal and AUC surrogate losses alongside binary cross-entropy. Experiments across four datasets demonstrate that both response agreement and focal loss improve calibration from baselines. We find that few-shot prompts are the most effective for auxiliary model-based methods, and auxiliary models demonstrate robust calibration performance across accuracy variations, outperforming LLMs' internal probabilities and verbalized confidences. These insights deepen the understanding of influence factors in LLM calibration, supporting their reliable deployment in diverse applications.

Related papers

Zero-Shot Strategies for Length-Controllable Summarization [56.15356055672189]
Large language models (LLMs) struggle with precise length control, particularly in zero-shot settings. We conduct a comprehensive study evaluating LLMs' length control capabilities across multiple measures and propose practical methods to improve controllability. Our experiments with LLaMA 3 reveal stark differences in length adherence across measures and highlight inherent biases of the model.
arXiv Detail & Related papers (2024-12-31T02:53:27Z)
Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration [20.049443396032423]
Black-box large language models (LLMs) are increasingly deployed in various environments. LLMs often exhibit overconfidence, leading to potential risks and misjudgments. We propose a novel method, textitAtypical presentations Recalibration, which leverages atypical presentations to adjust the model's confidence estimates.
arXiv Detail & Related papers (2024-09-05T03:45:35Z)
CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives. Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance. In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z)
Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation [18.815226646364476]
Existing calibration methods for large language models (LLMs) focus on estimating or eliciting individual confidence without taking full advantage of the "Collective Wisdom" We propose Collaborative, a post-hoc training-free calibration strategy that leverages the collaborative and expressive capabilities of multiple tool-augmented LLM agents in a simulated group deliberation process.
arXiv Detail & Related papers (2024-04-14T02:40:43Z)
Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models [84.94220787791389]
We propose Fact-and-Reflection (FaR) prompting, which improves the LLM calibration in two steps. Experiments show that FaR achieves significantly better calibration; it lowers the Expected Error by 23.5%. FaR even elicits the capability of verbally expressing concerns in less confident scenarios.
arXiv Detail & Related papers (2024-02-27T01:37:23Z)
Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. Results show that consistency-based calibration methods outperform existing post-hoc approaches. We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z)
Thermometer: Towards Universal Calibration for Large Language Models [22.03852781949075]
We propose OMETER, a calibration approach tailored to large language models (LLM) OMETER learns an auxiliary model, given data from multiple tasks, for calibrating a LLM. It is computationally efficient, preserves the accuracy of the LLM, and produces better-calibrated responses for new tasks.
arXiv Detail & Related papers (2024-02-20T04:13:48Z)
On Task Performance and Model Calibration with Supervised and Self-Ensembled In-Context Learning [71.44986275228747]
In-context learning (ICL) has become an efficient approach propelled by the recent advancements in large language models (LLMs) However, both paradigms are prone to suffer from the critical problem of overconfidence (i.e., miscalibration)
arXiv Detail & Related papers (2023-12-21T11:55:10Z)
On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes. Our analysis reveals that the impact of diversified human preferences depends on both model size and data size. Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them.
arXiv Detail & Related papers (2023-12-12T16:17:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.