Related papers: Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models

Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models

URL: http://arxiv.org/abs/2210.15458v2
Date: Thu, 1 Jun 2023 16:18:51 GMT
Title: Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models
Authors: Luke Vilnis, Yury Zemlyanskiy, Patrick Murray, Alexandre Passos, Sumit Sanghai
Abstract summary: Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize. We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model.
Score: 65.52639709094963
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Decoding methods for large language models often trade-off between diversity of outputs and parallelism of computation. Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize. Alternatively, methods such as temperature sampling and its modifications (top-k sampling, nucleus sampling, typical decoding, and others), are embarrassingly parallel, but have no guarantees about duplicate samples. We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model, compatible with common sampling variations, with provable beam diversity under certain conditions, as well as being embarrassingly parallel and providing unbiased and consistent expectations from the original model. We demonstrate the effectiveness of our approach on WMT machine translation, more than halving the standard deviation when estimating expected BLEU score reward, and closing the BLEU score gap between independent sampling and beam search by up to 63%.

Related papers

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling [59.8051705468084]
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models. We present FR-Spec, a frequency-ranked speculative sampling framework that optimize draft candidate selection through vocabulary space compression.
arXiv Detail & Related papers (2025-02-20T18:58:10Z)
Quasi-random Multi-Sample Inference for Large Language Models [1.647759094903376]
Large language models (LLMs) are often equipped with multi-sample decoding strategies. Traditional text generation methods, such as beam search and sampling-based techniques, have notable limitations. This study explores the potential of arithmetic sampling, contrasting it with ancestral sampling.
arXiv Detail & Related papers (2024-11-09T18:55:04Z)
Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference [35.730941605490194]
Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens. This paper explores the novel integration of speculative decoding with beam sampling.
arXiv Detail & Related papers (2024-09-25T02:20:42Z)
Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation [60.493180081319785]
We propose a systematic way to estimate the intrinsic capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step. Our work provides a comprehensive comparison between existing truncation sampling methods, as well as their recommended parameters as a guideline for users.
arXiv Detail & Related papers (2024-08-24T14:14:32Z)
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs [4.122612309805664]
Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step. We propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model's confidence by scaling according to the top token's probability. We conduct extensive experiments on benchmarks including GPQA, GSM8K, and AlpacaEval Creative Writing, demonstrating that min-p sampling improves both the quality and diversity of generated text, particularly at high temperatures.
arXiv Detail & Related papers (2024-07-01T08:37:25Z)
A Block Metropolis-Hastings Sampler for Controllable Energy-based Text Generation [78.81021361497311]
We develop a novel Metropolis-Hastings (MH) sampler that proposes re-writes of the entire sequence in each step via iterative prompting of a large language model. Our new sampler allows for more efficient and accurate sampling from a target distribution and (b) allows generation length to be determined through the sampling procedure rather than fixed in advance.
arXiv Detail & Related papers (2023-12-07T18:30:15Z)
Conformal Language Modeling [61.94417935386489]
We propose a novel approach to conformal prediction for generative language models (LMs) Standard conformal prediction produces prediction sets with rigorous, statistical guarantees. We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation.
arXiv Detail & Related papers (2023-06-16T21:55:08Z)
Structured Voronoi Sampling [61.629198273926676]
In this paper, we take an important step toward building a principled approach for sampling from language models with gradient-based methods. We name our gradient-based technique Structured Voronoi Sampling (SVS) In a controlled generation task, SVS is able to generate fluent and diverse samples while following the control targets significantly better than other methods.
arXiv Detail & Related papers (2023-06-05T17:32:35Z)
MacLaSa: Multi-Aspect Controllable Text Generation via Efficient Sampling from Compact Latent Space [110.85888003111653]
Multi-aspect controllable text generation aims to generate fluent sentences that possess multiple desired attributes simultaneously. We introduce a novel approach for multi-aspect control, namely MacLaSa, that estimates compact latent space for multiple aspects. We show that MacLaSa outperforms several strong baselines on attribute relevance and textual quality while maintaining a high inference speed.
arXiv Detail & Related papers (2023-05-22T07:30:35Z)
Accelerating Large Language Model Decoding with Speculative Sampling [9.851546623666588]
speculative sampling is an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup.
arXiv Detail & Related papers (2023-02-02T18:44:11Z)
Ensemble Slice Sampling: Parallel, black-box and gradient-free inference for correlated & multimodal distributions [0.0]
Slice Sampling has emerged as a powerful Markov Chain Monte Carlo algorithm that adapts to the characteristics of the target distribution with minimal hand-tuning. This paper introduces Ensemble Slice Sampling (ESS), a new class of algorithms that bypasses such difficulties by adaptively tuning the initial length scale. These affine-invariant algorithms are trivial to construct, require no hand-tuning, and can easily be implemented in parallel computing environments.
arXiv Detail & Related papers (2020-02-14T19:00:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.