LitCab: Lightweight Language Model Calibration over Short- and Long-form
Responses
- URL: http://arxiv.org/abs/2310.19208v2
- Date: Wed, 13 Mar 2024 05:11:57 GMT
- Title: LitCab: Lightweight Language Model Calibration over Short- and Long-form
Responses
- Authors: Xin Liu, Muhammad Khalifa, Lu Wang
- Abstract summary: We present LitCab, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term.
For evaluation, we construct CaT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs.
We test LitCab with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%.
- Score: 14.77013588561901
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A model is considered well-calibrated when its probability estimate aligns
with the actual likelihood of the output being correct. Calibrating language
models (LMs) is crucial, as it plays a vital role in detecting and mitigating
hallucinations of LMs as well as building more trustworthy models. However,
standard calibration techniques may not be suited for LM calibration. For
instance, post-processing methods such as temperature scaling do not reorder
the candidate generations. On the other hand, training-based methods require
fine-tuning the entire model, which is impractical for LMs of large scale. We
present LitCab, a lightweight calibration mechanism consisting of a single
linear layer that takes the input text representation and predicts a bias term,
which is then added to the LM output logits. LitCab improves model calibration
by only adding < 2% of the original model parameters. For evaluation, we
construct CaT, a benchmark consisting of eight text generation tasks, covering
responses ranging from short phrases to paragraphs. We test LitCab with
Llama2-7B, where it improves calibration across all tasks, reducing the average
ECE score by as large as 30%. We further conduct a comprehensive evaluation
with multiple popular open-sourced LMs from GPT and LLaMA families, yielding
the following key findings: (i) Larger models within the same family exhibit
better calibration on tasks with short generation tasks, but not necessarily
for longer ones. (ii) GPT-family models show superior calibration compared to
LLaMA, Llama2, and Vicuna models, despite having much fewer parameters. (iii)
Fine-tuning pretrained model (e.g., LLaMA) with samples of limited purpose
(e.g., conversations) may lead to worse calibration, highlighting the
importance of fine-tuning setups for calibrating LMs.
Related papers
- Calibrating Language Models with Adaptive Temperature Scaling [58.056023173579625]
We introduce Adaptive Temperature Scaling (ATS), a post-hoc calibration method that predicts a temperature scaling parameter for each token prediction.
ATS improves calibration by over 10-50% across three downstream natural language evaluation benchmarks compared to prior calibration methods.
arXiv Detail & Related papers (2024-09-29T22:54:31Z) - Does Alignment Tuning Really Break LLMs' Internal Confidence? [5.893124686141782]
Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration.
This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods.
arXiv Detail & Related papers (2024-08-31T05:12:36Z) - Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs [27.38239289662178]
Post-Training Quantization (PTQ) enhances the efficiency of Large Language Models (LLMs)
We explore the role of calibration sets in PTQ, specifically their effect on hidden activations.
Our analysis reveals a marked contrast in quantization effectiveness across accessible models.
arXiv Detail & Related papers (2024-05-31T14:24:33Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models.
We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process.
Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z) - A Close Look into the Calibration of Pre-trained Language Models [56.998539510508515]
Pre-trained language models (PLMs) may fail in giving reliable estimates of their predictive uncertainty.
We study the dynamic change in PLMs' calibration performance in training.
We extend two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations.
arXiv Detail & Related papers (2022-10-31T21:31:07Z) - On the Calibration of Massively Multilingual Language Models [15.373725507698591]
Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer.
We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages.
We also find that few-shot examples in the language can further help reduce the calibration errors, often substantially.
arXiv Detail & Related papers (2022-10-21T21:41:56Z) - Localized Calibration: Metrics and Recalibration [133.07044916594361]
We propose a fine-grained calibration metric that spans the gap between fully global and fully individualized calibration.
We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods.
arXiv Detail & Related papers (2021-02-22T07:22:12Z) - Uncertainty Quantification and Deep Ensembles [79.4957965474334]
We show that deep-ensembles do not necessarily lead to improved calibration properties.
We show that standard ensembling methods, when used in conjunction with modern techniques such as mixup regularization, can lead to less calibrated models.
This text examines the interplay between three of the most simple and commonly used approaches to leverage deep learning when data is scarce.
arXiv Detail & Related papers (2020-07-17T07:32:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.