Conformal Nucleus Sampling
- URL: http://arxiv.org/abs/2305.02633v1
- Date: Thu, 4 May 2023 08:11:57 GMT
- Title: Conformal Nucleus Sampling
- Authors: Shauli Ravfogel, Yoav Goldberg and Jacob Goldberger
- Abstract summary: We assess whether a top-$p$ set is indeed aligned with its probabilistic meaning in various linguistic contexts.
We find that OPT models are overconfident, and that calibration shows a moderate inverse scaling with model size.
- Score: 67.5232384936661
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models generate text based on successively sampling the next word. A
decoding procedure based on nucleus (top-$p$) sampling chooses from the
smallest possible set of words whose cumulative probability exceeds the
probability $p$. In this work, we assess whether a top-$p$ set is indeed
aligned with its probabilistic meaning in various linguistic contexts. We
employ conformal prediction, a calibration procedure that focuses on the
construction of minimal prediction sets according to a desired confidence
level, to calibrate the parameter $p$ as a function of the entropy of the next
word distribution. We find that OPT models are overconfident, and that
calibration shows a moderate inverse scaling with model size.
Related papers
- Conformal Language Modeling [61.94417935386489]
We propose a novel approach to conformal prediction for generative language models (LMs)
Standard conformal prediction produces prediction sets with rigorous, statistical guarantees.
We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation.
arXiv Detail & Related papers (2023-06-16T21:55:08Z) - Truncation Sampling as Language Model Desmoothing [115.28983143361681]
Long samples of text from neural language models can be of poor quality.
Truncation sampling algorithms set some words' probabilities to zero at each step.
We introduce $eta$-sampling, which truncates words below an entropy-dependent probability threshold.
arXiv Detail & Related papers (2022-10-27T05:52:35Z) - Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models [65.52639709094963]
Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize.
We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model.
arXiv Detail & Related papers (2022-10-18T22:19:41Z) - Probabilistic Conformal Prediction Using Conditional Random Samples [73.26753677005331]
PCP is a predictive inference algorithm that estimates a target variable by a discontinuous predictive set.
It is efficient and compatible with either explicit or implicit conditional generative models.
arXiv Detail & Related papers (2022-06-14T03:58:03Z) - Calibration of Natural Language Understanding Models with Venn--ABERS
Predictors [0.0]
Transformers are prone to generate uncalibrated predictions or extreme probabilities.
We build several inductive Venn--ABERS predictors (IVAP) based on a selection of pre-trained transformers.
arXiv Detail & Related papers (2022-05-21T13:09:01Z) - $k$-Neighbor Based Curriculum Sampling for Sequence Prediction [22.631763991832862]
Multi-step ahead prediction in language models is challenging due to discrepancy between training and test time processes.
We propose textitNearest-Neighbor Replacement Sampling -- a curriculum learning-based method that gradually changes an initially deterministic teacher policy.
We report our findings on two language modelling benchmarks and find that the proposed method further improves performance when used in conjunction with scheduled sampling.
arXiv Detail & Related papers (2021-01-22T20:07:29Z) - On Misspecification in Prediction Problems and Robustness via Improper
Learning [23.64462813525688]
We show that for a broad class of loss functions and parametric families of distributions, the regret of playing a "proper" predictor has lower bound scaling at least as $sqrtgamma n$.
We exhibit instances in which this is unimprovable even over the family of all learners that may play distributions in the convex hull of the parametric family.
arXiv Detail & Related papers (2021-01-13T17:54:08Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.