Truncation Sampling as Language Model Desmoothing
- URL: http://arxiv.org/abs/2210.15191v1
- Date: Thu, 27 Oct 2022 05:52:35 GMT
- Title: Truncation Sampling as Language Model Desmoothing
- Authors: John Hewitt, Christopher D. Manning, Percy Liang
- Abstract summary: Long samples of text from neural language models can be of poor quality.
Truncation sampling algorithms set some words' probabilities to zero at each step.
We introduce $eta$-sampling, which truncates words below an entropy-dependent probability threshold.
- Score: 115.28983143361681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long samples of text from neural language models can be of poor quality.
Truncation sampling algorithms--like top-$p$ or top-$k$ -- address this by
setting some words' probabilities to zero at each step. This work provides
framing for the aim of truncation, and an improved algorithm for that aim. We
propose thinking of a neural language model as a mixture of a true distribution
and a smoothing distribution that avoids infinite perplexity. In this light,
truncation algorithms aim to perform desmoothing, estimating a subset of the
support of the true distribution. Finding a good subset is crucial: we show
that top-$p$ unnecessarily truncates high-probability words, for example
causing it to truncate all words but Trump for a document that starts with
Donald. We introduce $\eta$-sampling, which truncates words below an
entropy-dependent probability threshold. Compared to previous algorithms,
$\eta$-sampling generates more plausible long English documents according to
humans, is better at breaking out of repetition, and behaves more reasonably on
a battery of test distributions.
Related papers
- SpecTr: Fast Speculative Decoding via Optimal Transport [30.18181671899423]
We develop a new autoregressive sampling algorithm called $textitSpecTr$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output.
We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
arXiv Detail & Related papers (2023-10-23T17:47:34Z) - Closing the Curious Case of Neural Text Degeneration [91.22954750742183]
We provide a theoretical explanation for the effectiveness of the truncation sampling.
We show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability.
Our evaluations show that our method outperforms its threshold-based counterparts for low-entropy text generation.
arXiv Detail & Related papers (2023-10-02T23:16:25Z) - Conformal Nucleus Sampling [67.5232384936661]
We assess whether a top-$p$ set is indeed aligned with its probabilistic meaning in various linguistic contexts.
We find that OPT models are overconfident, and that calibration shows a moderate inverse scaling with model size.
arXiv Detail & Related papers (2023-05-04T08:11:57Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive.
We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - Improving Diversity of Neural Text Generation via Inverse Probability
Weighting [43.36560720793425]
We propose a sampling method inspired by inverse probability weighting.
We show might contain tedious or even repetitive candidates with high probability that lead to repetition loops.
Results show that our algorithm can effectively increase the diversity of generated samples while achieving close resemblance to human text.
arXiv Detail & Related papers (2021-03-13T08:17:40Z) - A Randomized Algorithm to Reduce the Support of Discrete Measures [79.55586575988292]
Given a discrete probability measure supported on $N$ atoms and a set of $n$ real-valued functions, there exists a probability measure that is supported on a subset of $n+1$ of the original $N$ atoms.
We give a simple geometric characterization of barycenters via negative cones and derive a randomized algorithm that computes this new measure by "greedy geometric sampling"
We then study its properties, and benchmark it on synthetic and real-world data to show that it can be very beneficial in the $Ngg n$ regime.
arXiv Detail & Related papers (2020-06-02T16:38:36Z) - A New Minimax Theorem for Randomized Algorithms [1.2284934135116514]
We introduce a new type of minimax theorem which can provide a hard distribution $mu$ that works for all bias levels at once.
We show that this works for randomized query complexity, randomized communication complexity, approximate degreelemma, and approximate logrank.
We also prove an improved version of Impagliazzo's hardcore.
arXiv Detail & Related papers (2020-02-25T11:46:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.