Related papers: Truncation Sampling as Language Model Desmoothing

Truncation Sampling as Language Model Desmoothing

URL: http://arxiv.org/abs/2210.15191v1
Date: Thu, 27 Oct 2022 05:52:35 GMT
Title: Truncation Sampling as Language Model Desmoothing
Authors: John Hewitt, Christopher D. Manning, Percy Liang
Abstract summary: Long samples of text from neural language models can be of poor quality. Truncation sampling algorithms set some words' probabilities to zero at each step. We introduce $eta$-sampling, which truncates words below an entropy-dependent probability threshold.
Score: 115.28983143361681
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long samples of text from neural language models can be of poor quality. Truncation sampling algorithms--like top-$p$ or top-$k$ -- address this by setting some words' probabilities to zero at each step. This work provides framing for the aim of truncation, and an improved algorithm for that aim. We propose thinking of a neural language model as a mixture of a true distribution and a smoothing distribution that avoids infinite perplexity. In this light, truncation algorithms aim to perform desmoothing, estimating a subset of the support of the true distribution. Finding a good subset is crucial: we show that top-$p$ unnecessarily truncates high-probability words, for example causing it to truncate all words but Trump for a document that starts with Donald. We introduce $\eta$-sampling, which truncates words below an entropy-dependent probability threshold. Compared to previous algorithms, $\eta$-sampling generates more plausible long English documents according to humans, is better at breaking out of repetition, and behaves more reasonably on a battery of test distributions.

Related papers

Foundations of Top-$k$ Decoding For Language Models [19.73575905188064]
We develop a theoretical framework that both explains and generalizes top-$k$ decoding.<n>We show how to optimize it efficiently for a large class of divergences.
arXiv Detail & Related papers (2025-05-25T23:46:34Z)
Entangled Mean Estimation in High-Dimensions [36.97113089188035]
We study the task of high-dimensional entangled mean estimation in the subset-of-signals model. We show that the optimal error (up to polylogarithmic factors) is $f(alpha,N) + sqrtD/(alpha N)$, where the term $f(alpha,N)$ is the error of the one-dimensional problem and the second term is the sub-Gaussian error rate.
arXiv Detail & Related papers (2025-01-09T18:31:35Z)
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models [70.07661254213181]
We propose two algorithms that enjoy provable scaling laws for the test-time compute of large language models.<n>One is a two-stage knockout-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents.<n>The other is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents.
arXiv Detail & Related papers (2024-11-29T05:29:47Z)
SpecTr: Fast Speculative Decoding via Optimal Transport [30.18181671899423]
We develop a new autoregressive sampling algorithm called $textitSpecTr$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
arXiv Detail & Related papers (2023-10-23T17:47:34Z)
Closing the Curious Case of Neural Text Degeneration [91.22954750742183]
We provide a theoretical explanation for the effectiveness of the truncation sampling. We show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability. Our evaluations show that our method outperforms its threshold-based counterparts for low-entropy text generation.
arXiv Detail & Related papers (2023-10-02T23:16:25Z)
Conformal Nucleus Sampling [67.5232384936661]
We assess whether a top-$p$ set is indeed aligned with its probabilistic meaning in various linguistic contexts. We find that OPT models are overconfident, and that calibration shows a moderate inverse scaling with model size.
arXiv Detail & Related papers (2023-05-04T08:11:57Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive. We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z)
Improving Diversity of Neural Text Generation via Inverse Probability Weighting [43.36560720793425]
We propose a sampling method inspired by inverse probability weighting. We show might contain tedious or even repetitive candidates with high probability that lead to repetition loops. Results show that our algorithm can effectively increase the diversity of generated samples while achieving close resemblance to human text.
arXiv Detail & Related papers (2021-03-13T08:17:40Z)
A Randomized Algorithm to Reduce the Support of Discrete Measures [79.55586575988292]
Given a discrete probability measure supported on $N$ atoms and a set of $n$ real-valued functions, there exists a probability measure that is supported on a subset of $n+1$ of the original $N$ atoms. We give a simple geometric characterization of barycenters via negative cones and derive a randomized algorithm that computes this new measure by "greedy geometric sampling" We then study its properties, and benchmark it on synthetic and real-world data to show that it can be very beneficial in the $Ngg n$ regime.
arXiv Detail & Related papers (2020-06-02T16:38:36Z)
A New Minimax Theorem for Randomized Algorithms [1.2284934135116514]
We introduce a new type of minimax theorem which can provide a hard distribution $mu$ that works for all bias levels at once. We show that this works for randomized query complexity, randomized communication complexity, approximate degreelemma, and approximate logrank. We also prove an improved version of Impagliazzo's hardcore.
arXiv Detail & Related papers (2020-02-25T11:46:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.