Calibrate Before Use: Improving Few-Shot Performance of Language Models
- URL: http://arxiv.org/abs/2102.09690v1
- Date: Fri, 19 Feb 2021 00:23:59 GMT
- Title: Calibrate Before Use: Improving Few-Shot Performance of Language Models
- Authors: Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh
- Abstract summary: GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples.
We show that this type of few-shot learning can be unstable.
The choice of prompt format, training examples, and even the order of the training examples can cause accuracy to vary from near chance to near state-of-the-art.
- Score: 68.17016463756474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GPT-3 can perform numerous tasks when provided a natural language prompt that
contains a few training examples. We show that this type of few-shot learning
can be unstable: the choice of prompt format, training examples, and even the
order of the training examples can cause accuracy to vary from near chance to
near state-of-the-art. We demonstrate that this instability arises from the
bias of language models towards predicting certain answers, e.g., those that
are placed near the end of the prompt or are common in the pre-training data.
To mitigate this, we first estimate the model's bias towards each answer by
asking for its prediction when given the training prompt and a content-free
test input such as "N/A". We then fit calibration parameters that cause the
prediction for this input to be uniform across answers. On a diverse set of
tasks, this contextual calibration procedure substantially improves GPT-3 and
GPT-2's average accuracy (up to 30.0% absolute) and reduces variance across
different choices of the prompt.
Related papers
- Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot
Classification [20.85088711770188]
We show that it is possible to improve prompt-based learning without additional labeled data.
We propose Embroid, a method which computes multiple representations of a dataset under different embedding functions.
We find that Embroid substantially improves performance over original prompts.
arXiv Detail & Related papers (2023-07-20T17:07:28Z) - Unsupervised Calibration through Prior Adaptation for Text
Classification using Large Language Models [37.39843935632105]
We propose an approach to adapt the prior class distribution to perform text classification tasks without the need for labelled samples.
Results show that these methods outperform the un-adapted model for different number of training shots in the prompt.
arXiv Detail & Related papers (2023-07-13T12:11:36Z) - Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats.
We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes.
We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z) - Improving Few-Shot Performance of Language Models via Nearest Neighbor
Calibration [12.334422701057674]
We propose a novel nearest-neighbor calibration framework for in-context learning.
It is inspired by a phenomenon that the in-context learning paradigm produces incorrect labels when inferring training instances.
Experiments on various few-shot text classification tasks demonstrate that our method significantly improves in-context learning.
arXiv Detail & Related papers (2022-12-05T12:49:41Z) - Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language
Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample.
TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average.
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z) - The Unreliability of Explanations in Few-Shot In-Context Learning [50.77996380021221]
We focus on two NLP tasks that involve reasoning over text, namely question answering and natural language inference.
We show that explanations judged as good by humans--those that are logically consistent with the input--usually indicate more accurate predictions.
We present a framework for calibrating model predictions based on the reliability of the explanations.
arXiv Detail & Related papers (2022-05-06T17:57:58Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - Towards Improving Selective Prediction Ability of NLP Systems [24.774450633678125]
We propose a method that improves probability estimates of models by calibrating them using prediction confidence and difficulty score of instances.
We instantiate our method with Natural Language Inference (NLI) and Duplicate Detection (DD) tasks and evaluate it in both In-Domain (IID) and Out-of-Domain (OOD) settings.
arXiv Detail & Related papers (2020-08-21T08:46:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.