Thermometer: Towards Universal Calibration for Large Language Models
- URL: http://arxiv.org/abs/2403.08819v2
- Date: Thu, 27 Jun 2024 16:30:32 GMT
- Title: Thermometer: Towards Universal Calibration for Large Language Models
- Authors: Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, Soumya Ghosh,
- Abstract summary: We propose OMETER, a calibration approach tailored to large language models (LLM)
OMETER learns an auxiliary model, given data from multiple tasks, for calibrating a LLM.
It is computationally efficient, preserves the accuracy of the LLM, and produces better-calibrated responses for new tasks.
- Score: 22.03852781949075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the issue of calibration in large language models (LLM). Recent studies have found that common interventions such as instruction tuning often result in poorly calibrated LLMs. Although calibration is well-explored in traditional applications, calibrating LLMs is uniquely challenging. These challenges stem as much from the severe computational requirements of LLMs as from their versatility, which allows them to be applied to diverse tasks. Addressing these challenges, we propose THERMOMETER, a calibration approach tailored to LLMs. THERMOMETER learns an auxiliary model, given data from multiple tasks, for calibrating a LLM. It is computationally efficient, preserves the accuracy of the LLM, and produces better-calibrated responses for new tasks. Extensive empirical evaluations across various benchmarks demonstrate the effectiveness of the proposed method.
Related papers
- Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles [4.477423478591491]
Calib-n is a novel framework that trains an auxiliary model for confidence estimation.
We find that few-shot prompts are the most effective for auxiliary model-based methods.
arXiv Detail & Related papers (2025-01-07T18:48:42Z) - Zero-Shot Strategies for Length-Controllable Summarization [56.15356055672189]
Large language models (LLMs) struggle with precise length control, particularly in zero-shot settings.
We conduct a comprehensive study evaluating LLMs' length control capabilities across multiple measures and propose practical methods to improve controllability.
Our experiments with LLaMA 3 reveal stark differences in length adherence across measures and highlight inherent biases of the model.
arXiv Detail & Related papers (2024-12-31T02:53:27Z) - Guiding Reinforcement Learning Using Uncertainty-Aware Large Language Models [1.2233495442213964]
Large Language Models (LLMs) offer a promising alternative to mitigate RL sample inefficiency and potentially replace human trainers.
We address this limitation by a calibrated guidance system that uses Monte Carlo Dropout to enhance LLM advice reliability.
We also develop a novel RL policy shaping method based on dynamic model average entropy to adjust the LLM's influence on RL policies according to guidance uncertainty.
arXiv Detail & Related papers (2024-11-15T22:00:29Z) - CorrectionLM: Self-Corrections with SLM for Dialogue State Tracking [16.057622631156164]
Large language models (LLMs) have demonstrated self-improvement capabilities via feedback and refinement, but current small language models (SLMs) have had limited success in this area.
We introduce CORRECTIONLM, a novel correction framework that enables SLMs to self-correct using in-context exemplars without LLM involvement.
arXiv Detail & Related papers (2024-10-23T18:27:16Z) - Atomic Calibration of LLMs in Long-Form Generations [46.01229352035088]
Large language models (LLMs) often suffer from hallucinations, posing significant challenges for real-world applications.
We introduce atomic calibration, a novel approach that evaluates factuality calibration at a fine-grained level by breaking down long responses into atomic claims.
Our experiments show that atomic calibration is well-suited for long-form generation and can also improve macro calibration results.
arXiv Detail & Related papers (2024-10-17T06:09:26Z) - Does Alignment Tuning Really Break LLMs' Internal Confidence? [5.893124686141782]
Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration.
This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods.
arXiv Detail & Related papers (2024-08-31T05:12:36Z) - CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives.
Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance.
In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z) - Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z) - Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models [84.94220787791389]
We propose Fact-and-Reflection (FaR) prompting, which improves the LLM calibration in two steps.
Experiments show that FaR achieves significantly better calibration; it lowers the Expected Error by 23.5%.
FaR even elicits the capability of verbally expressing concerns in less confident scenarios.
arXiv Detail & Related papers (2024-02-27T01:37:23Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - On Task Performance and Model Calibration with Supervised and
Self-Ensembled In-Context Learning [71.44986275228747]
In-context learning (ICL) has become an efficient approach propelled by the recent advancements in large language models (LLMs)
However, both paradigms are prone to suffer from the critical problem of overconfidence (i.e., miscalibration)
arXiv Detail & Related papers (2023-12-21T11:55:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.