Accelerating Large Language Model Decoding with Speculative Sampling
- URL: http://arxiv.org/abs/2302.01318v1
- Date: Thu, 2 Feb 2023 18:44:11 GMT
- Title: Accelerating Large Language Model Decoding with Speculative Sampling
- Authors: Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste
Lespiau, Laurent Sifre, John Jumper
- Abstract summary: speculative sampling is an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call.
We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup.
- Score: 9.851546623666588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present speculative sampling, an algorithm for accelerating transformer
decoding by enabling the generation of multiple tokens from each transformer
call. Our algorithm relies on the observation that the latency of parallel
scoring of short continuations, generated by a faster but less powerful draft
model, is comparable to that of sampling a single token from the larger target
model. This is combined with a novel modified rejection sampling scheme which
preserves the distribution of the target model within hardware numerics. We
benchmark speculative sampling with Chinchilla, a 70 billion parameter language
model, achieving a 2-2.5x decoding speedup in a distributed setup, without
compromising the sample quality or making modifications to the model itself.
Related papers
- AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration [0.3626013617212667]
We introduce AMUSD (Asynchronous Multi-device Speculative Decoding), a system that accelerates generation by decoupling the draft and verify phases.
Unlike conventional speculative decoding, where only one model (draft or verify) performs token generation at a time, AMUSD enables both models to perform predictions independently on separate devices.
We evaluate our approach over multiple datasets and show that AMUSD achieves an average 29% improvement over speculative decoding and up to 1.96$times$ speedup over conventional autoregressive decoding.
arXiv Detail & Related papers (2024-10-22T19:15:35Z) - Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference [35.730941605490194]
Large language models (LLMs) have shown outstanding performance across numerous real-world tasks.
Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens.
This paper explores the novel integration of speculative decoding with beam sampling.
arXiv Detail & Related papers (2024-09-25T02:20:42Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.
This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming.
One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model.
This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification.
We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion.
In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z) - Fast Inference from Transformers via Speculative Decoding [3.950600027250452]
Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model.
In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel.
arXiv Detail & Related papers (2022-11-30T17:33:28Z) - Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models [65.52639709094963]
Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize.
We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model.
arXiv Detail & Related papers (2022-10-18T22:19:41Z) - Megapixel Image Generation with Step-Unrolled Denoising Autoencoders [5.145313322824774]
We propose a combination of techniques to push sample resolutions higher and reduce computational requirements for training and sampling.
These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model.
Our proposed framework scales to high-resolutions ($1024 times 1024$) and trains quickly (
arXiv Detail & Related papers (2022-06-24T15:47:42Z) - Cascaded Text Generation with Markov Transformers [122.76100449018061]
Two dominant approaches to neural text generation are fully autoregressive models, using serial beam search decoding, and non-autoregressive models, using parallel decoding with no output dependencies.
This work proposes an autoregressive model with sub-linear parallel time generation. Noting that conditional random fields with bounded context can be decoded in parallel, we propose an efficient cascaded decoding approach for generating high-quality output.
This approach requires only a small modification from standard autoregressive training, while showing competitive accuracy/speed tradeoff compared to existing methods on five machine translation datasets.
arXiv Detail & Related papers (2020-06-01T17:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.