Mirostat: A Neural Text Decoding Algorithm that Directly Controls
Perplexity
- URL: http://arxiv.org/abs/2007.14966v2
- Date: Thu, 14 Jan 2021 21:36:02 GMT
- Title: Mirostat: A Neural Text Decoding Algorithm that Directly Controls
Perplexity
- Authors: Sourya Basu, Govardana Sachitanandam Ramachandran, Nitish Shirish
Keskar, Lav R. Varshney
- Abstract summary: We use a theoretical analysis of perplexity in top-k, top-p, and temperature sampling to design a feedback-based adaptive top-k text decoding algorithm called mirostat.
Experiments show that for low values of k and p in top-k and top-p sampling, perplexity drops significantly with generated text length.
For large values of k and p, perplexity increases with generated text length, which is correlated with incoherence in the text.
- Score: 22.15683400807154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural text decoding is important for generating high-quality texts using
language models. To generate high-quality text, popular decoding algorithms
like top-k, top-p (nucleus), and temperature-based sampling truncate or distort
the unreliable low probability tail of the language model. Though these methods
generate high-quality text after parameter tuning, they are ad hoc. Not much is
known about the control they provide over the statistics of the output, which
is important since recent reports show text quality is highest for a specific
range of likelihoods. Here, first we provide a theoretical analysis of
perplexity in top-k, top-p, and temperature sampling, finding that
cross-entropy behaves approximately linearly as a function of p in top-p
sampling whereas it is a nonlinear function of k in top-k sampling, under
Zipfian statistics. We use this analysis to design a feedback-based adaptive
top-k text decoding algorithm called mirostat that generates text (of any
length) with a predetermined value of perplexity, and thereby high-quality text
without any tuning. Experiments show that for low values of k and p in top-k
and top-p sampling, perplexity drops significantly with generated text length,
which is also correlated with excessive repetitions in the text (the boredom
trap). On the other hand, for large values of k and p, we find that perplexity
increases with generated text length, which is correlated with incoherence in
the text (confusion trap). Mirostat avoids both traps: experiments show that
cross-entropy has a near-linear relation with repetition in generated text.
This relation is almost independent of the sampling method but slightly
dependent on the model used. Hence, for a given language model, control over
perplexity also gives control over repetitions. Experiments with human raters
for fluency, coherence, and quality further verify our findings.
Related papers
- LRANet: Towards Accurate and Efficient Scene Text Detection with
Low-Rank Approximation Network [63.554061288184165]
We propose a novel parameterized text shape method based on low-rank approximation.
By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation.
We implement an accurate and efficient arbitrary-shaped text detector named LRANet.
arXiv Detail & Related papers (2023-06-27T02:03:46Z) - DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection [56.513637720967566]
Large language models (LLMs) can generate texts that pose risks of misuse, such as plagiarism, planting fake reviews on e-commerce platforms, or creating inflammatory false tweets.
Existing high-quality detection methods usually require access to the interior of the model to extract the intrinsic characteristics.
We propose to extract deep intrinsic characteristics of the black-box model generated texts.
arXiv Detail & Related papers (2023-05-21T17:26:16Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - TextDCT: Arbitrary-Shaped Text Detection via Discrete Cosine Transform
Mask [19.269070203448187]
Arbitrary-shaped scene text detection is a challenging task due to the variety of text changes in font, size, color, and orientation.
We propose a novel light-weight anchor-free text detection framework called TextDCT, which adopts the discrete cosine transform (DCT) to encode the text masks as compact vectors.
TextDCT achieves F-measure of 85.1 at 17.2 frames per second (FPS) and F-measure of 84.9 at 15.1 FPS for CTW1500 and Total-Text datasets, respectively.
arXiv Detail & Related papers (2022-06-27T15:42:25Z) - A Fast Randomized Algorithm for Massive Text Normalization [26.602776972067936]
We present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data.
Our algorithm relies on the Jaccard similarity between words to suggest correction results.
Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN.
arXiv Detail & Related papers (2021-10-06T19:18:17Z) - Bidirectional Regression for Arbitrary-Shaped Text Detection [16.30976392505236]
This paper presents a novel text instance expression which integrates both foreground and background information into the pipeline.
A corresponding post-processing algorithm is also designed to sequentially combine the four prediction results and reconstruct the text instance accurately.
We evaluate our method on several challenging scene text benchmarks, including both curved and multi-oriented text datasets.
arXiv Detail & Related papers (2021-07-13T14:29:09Z) - BOTD: Bold Outline Text Detector [85.33700624095181]
We propose a new one-stage text detector, termed as Bold Outline Text Detector (BOTD)
BOTD is able to process the arbitrary-shaped text with low model complexity.
Experimental results on three real-world benchmarks show the state-of-the-art performance of BOTD.
arXiv Detail & Related papers (2020-11-30T11:54:14Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z) - Investigating Label Bias in Beam Search for Open-ended Text Generation [8.331919991368366]
In open-ended text generation, beam search is often found to produce repetitive and generic texts.
Standard seq2seq models suffer from label bias due to its locally normalized probability formulation.
By combining locally normalized maximum likelihood estimation and globally normalized sequence-level training, label bias can be reduced with almost no sacrifice in perplexity.
arXiv Detail & Related papers (2020-05-22T05:17:53Z) - Sparse Text Generation [7.747003493657217]
Current text generators require sampling from a modified softmax, via temperature parameters or ad-hoc truncation techniques, as in top-$k$ or nucleus sampling.
In this paper, we use the recently introduced entmax transformation to train and sample from a sparse language model, avoiding this mismatch.
The result is a text generator with favorable performance in terms of fluency and consistency, fewer repetitions, and n-gram diversity closer to human text.
arXiv Detail & Related papers (2020-04-06T13:09:10Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.