Speculative Decoding: Exploiting Speculative Execution for Accelerating
Seq2seq Generation
- URL: http://arxiv.org/abs/2203.16487v6
- Date: Mon, 30 Oct 2023 01:36:06 GMT
- Title: Speculative Decoding: Exploiting Speculative Execution for Accelerating
Seq2seq Generation
- Authors: Heming Xia, Tao Ge, Peiyi Wang, Si-Qing Chen, Furu Wei, Zhifang Sui
- Abstract summary: We propose Speculative Decoding (SpecDec) to study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding.
SpecDec has two innovations: Spec-Drafter -- an independent model specially optimized for efficient drafting, and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently.
- Score: 80.2267931231335
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Speculative Decoding (SpecDec), for the first time ever, to
formally study exploiting the idea of speculative execution to accelerate
autoregressive (AR) decoding. Speculative Decoding has two innovations:
Spec-Drafter -- an independent model specially optimized for efficient and
accurate drafting -- and Spec-Verification -- a reliable method for verifying
the drafted tokens efficiently in the decoding paradigm. Experimental results
on various seq2seq tasks including machine translation and abstractive
summarization show our approach can achieve around $5\times$ speedup for the
popular Transformer architectures with comparable generation quality to beam
search decoding, refreshing the impression that the draft-then-verify paradigm
introduces only $1.4\times$$\sim$$2\times$ speedup. In addition to the
remarkable speedup, we also demonstrate 3 additional advantages of SpecDec,
revealing its practical value for accelerating generative models in real-world
applications. Our models and codes are available at
https://github.com/hemingkx/SpecDec.
Related papers
- Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference [41.93955876156331]
Large language models (LLMs) have demonstrated their power in various tasks, but their inference incurs significant time and energy costs.
speculative decoding uses a smaller model to propose one sequence of tokens, which are subsequently validated in batch by the target large model.
Compared with autoregressive decoding, speculative decoding generates the same number of tokens with fewer runs of the large model.
An algorithm that has better output perplexity and even better efficiency than speculative decoding can be more useful in practice.
arXiv Detail & Related papers (2024-07-12T23:29:54Z) - Optimizing Speculative Decoding for Serving Large Language Models Using Goodput [32.479057822334354]
speculative decoding is one of the most effective techniques for large language models.
We develop a dynamic framework SmartSpec to determine the best speculation length for each request.
We show that SmartSpec consistently reduces average request latency by up to 3.2x compared to non-speculative decoding baselines.
arXiv Detail & Related papers (2024-06-20T07:43:33Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models.
We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process.
Ouroboros can achieve speedups of up to $2.4times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z) - Speculative Streaming: Fast LLM Inference without Auxiliary Models [21.454206732725563]
Speculative Streaming is a single-model speculative decoding method.
It fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction.
It speeds up decoding by 1.8 - 3.1X in a diverse set of tasks.
arXiv Detail & Related papers (2024-02-16T23:36:43Z) - GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative
Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding.
GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM.
We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - Fast Interleaved Bidirectional Sequence Generation [90.58793284654692]
We introduce a decoder that generates target words from the left-to-right and right-to-left directions simultaneously.
We show that we can easily convert a standard architecture for unidirectional decoding into a bidirectional decoder.
Our interleaved bidirectional decoder (IBDecoder) retains the model simplicity and training efficiency of the standard Transformer.
arXiv Detail & Related papers (2020-10-27T17:38:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.