Uncertainty Quantification with Pre-trained Language Models: A
Large-Scale Empirical Analysis
- URL: http://arxiv.org/abs/2210.04714v1
- Date: Mon, 10 Oct 2022 14:16:01 GMT
- Title: Uncertainty Quantification with Pre-trained Language Models: A
Large-Scale Empirical Analysis
- Authors: Yuxin Xiao, Paul Pu Liang, Umang Bhatt, Willie Neiswanger, Ruslan
Salakhutdinov, Louis-Philippe Morency
- Abstract summary: It is crucial for the pipeline to minimize the calibration error, especially in safety-critical applications.
There are various considerations behind the pipeline: (1) the choice and (2) the size of PLM, (3) the choice of uncertainty quantifier, (4) the choice of fine-tuning loss, and many more.
In response, we recommend the following: (1) use ELECTRA for PLM encoding, (2) use larger PLMs if possible, (3) use Temp Scaling as the uncertainty quantifier, and (4) use Focal Loss for fine-tuning.
- Score: 120.9545643534454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models (PLMs) have gained increasing popularity due to
their compelling prediction performance in diverse natural language processing
(NLP) tasks. When formulating a PLM-based prediction pipeline for NLP tasks, it
is also crucial for the pipeline to minimize the calibration error, especially
in safety-critical applications. That is, the pipeline should reliably indicate
when we can trust its predictions. In particular, there are various
considerations behind the pipeline: (1) the choice and (2) the size of PLM, (3)
the choice of uncertainty quantifier, (4) the choice of fine-tuning loss, and
many more. Although prior work has looked into some of these considerations,
they usually draw conclusions based on a limited scope of empirical studies.
There still lacks a holistic analysis on how to compose a well-calibrated
PLM-based prediction pipeline. To fill this void, we compare a wide range of
popular options for each consideration based on three prevalent NLP
classification tasks and the setting of domain shift. In response, we recommend
the following: (1) use ELECTRA for PLM encoding, (2) use larger PLMs if
possible, (3) use Temp Scaling as the uncertainty quantifier, and (4) use Focal
Loss for fine-tuning.
Related papers
- Using Large Language Models for Expert Prior Elicitation in Predictive Modelling [53.54623137152208]
This study proposes using large language models (LLMs) to elicit expert prior distributions for predictive models.
We compare LLM-elicited and uninformative priors, evaluate whether LLMs truthfully generate parameter distributions, and propose a model selection strategy for in-context learning and prior elicitation.
Our findings show that LLM-elicited prior parameter distributions significantly reduce predictive error compared to uninformative priors in low-data settings.
arXiv Detail & Related papers (2024-11-26T10:13:39Z) - Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a QPP framework using automatically generated relevance judgments (QPP-GenRE)
QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query.
This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels.
arXiv Detail & Related papers (2024-04-01T09:33:05Z) - Embers of Autoregression: Understanding Large Language Models Through
the Problem They are Trained to Solve [21.55766758950951]
We make predictions about the strategies that large language models will adopt to solve next-word prediction tasks.
We evaluate two LLMs on eleven tasks and find robust evidence that LLMs are influenced by probability.
We conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system.
arXiv Detail & Related papers (2023-09-24T13:35:28Z) - Making Pre-trained Language Models both Task-solvers and
Self-calibrators [52.98858650625623]
Pre-trained language models (PLMs) serve as backbones for various real-world systems.
Previous work shows that introducing an extra calibration task can mitigate this issue.
We propose a training algorithm LM-TOAST to tackle the challenges.
arXiv Detail & Related papers (2023-07-21T02:51:41Z) - Selection by Prediction with Conformal p-values [7.917044695538599]
We study screening procedures that aim to select candidates whose unobserved outcomes exceed user-specified values.
We develop a method that wraps around any prediction model to produce a subset of candidates while controlling the proportion of falsely selected units.
arXiv Detail & Related papers (2022-10-04T06:34:49Z) - Solving Multistage Stochastic Linear Programming via Regularized Linear
Decision Rules: An Application to Hydrothermal Dispatch Planning [77.34726150561087]
We propose a novel regularization scheme for linear decision rules (LDR) based on the AdaSO (adaptive least absolute shrinkage and selection operator)
Experiments show that the overfit threat is non-negligible when using the classical non-regularized LDR to solve MSLP.
For the LHDP problem, our analysis highlights the following benefits of the proposed framework in comparison to the non-regularized benchmark.
arXiv Detail & Related papers (2021-10-07T02:36:14Z) - Towards Improving Selective Prediction Ability of NLP Systems [24.774450633678125]
We propose a method that improves probability estimates of models by calibrating them using prediction confidence and difficulty score of instances.
We instantiate our method with Natural Language Inference (NLI) and Duplicate Detection (DD) tasks and evaluate it in both In-Domain (IID) and Out-of-Domain (OOD) settings.
arXiv Detail & Related papers (2020-08-21T08:46:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.