Speculative Decoding: Exploiting Speculative Execution for Accelerating
Seq2seq Generation
- URL: http://arxiv.org/abs/2203.16487v6
- Date: Mon, 30 Oct 2023 01:36:06 GMT
- Title: Speculative Decoding: Exploiting Speculative Execution for Accelerating
Seq2seq Generation
- Authors: Heming Xia, Tao Ge, Peiyi Wang, Si-Qing Chen, Furu Wei, Zhifang Sui
- Abstract summary: We propose Speculative Decoding (SpecDec) to study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding.
SpecDec has two innovations: Spec-Drafter -- an independent model specially optimized for efficient drafting, and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently.
- Score: 80.2267931231335
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Speculative Decoding (SpecDec), for the first time ever, to
formally study exploiting the idea of speculative execution to accelerate
autoregressive (AR) decoding. Speculative Decoding has two innovations:
Spec-Drafter -- an independent model specially optimized for efficient and
accurate drafting -- and Spec-Verification -- a reliable method for verifying
the drafted tokens efficiently in the decoding paradigm. Experimental results
on various seq2seq tasks including machine translation and abstractive
summarization show our approach can achieve around $5\times$ speedup for the
popular Transformer architectures with comparable generation quality to beam
search decoding, refreshing the impression that the draft-then-verify paradigm
introduces only $1.4\times$$\sim$$2\times$ speedup. In addition to the
remarkable speedup, we also demonstrate 3 additional advantages of SpecDec,
revealing its practical value for accelerating generative models in real-world
applications. Our models and codes are available at
https://github.com/hemingkx/SpecDec.
Related papers
- SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference [9.143856130336783]
SuffixDecoding is a model-free approach to accelerating large language model (LLM) inference through speculative decoding.
Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models.
For a proprietary multi-LLM text-to-token application, SuffixDecoding achieves up to $2.9times$ higher output throughput and $3times$ lower latency than speculative decoding.
arXiv Detail & Related papers (2024-11-07T18:49:33Z) - AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration [0.3626013617212667]
We introduce AMUSD (Asynchronous Multi-device Speculative Decoding), a system that accelerates generation by decoupling the draft and verify phases.
Unlike conventional speculative decoding, where only one model (draft or verify) performs token generation at a time, AMUSD enables both models to perform predictions independently on separate devices.
We evaluate our approach over multiple datasets and show that AMUSD achieves an average 29% improvement over speculative decoding and up to 1.96$times$ speedup over conventional autoregressive decoding.
arXiv Detail & Related papers (2024-10-22T19:15:35Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.
This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - Speculative Streaming: Fast LLM Inference without Auxiliary Models [21.454206732725563]
Speculative Streaming is a single-model speculative decoding method.
It fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction.
It speeds up decoding by 1.8 - 3.1X in a diverse set of tasks.
arXiv Detail & Related papers (2024-02-16T23:36:43Z) - GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative
Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding.
GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM.
We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - Fast Interleaved Bidirectional Sequence Generation [90.58793284654692]
We introduce a decoder that generates target words from the left-to-right and right-to-left directions simultaneously.
We show that we can easily convert a standard architecture for unidirectional decoding into a bidirectional decoder.
Our interleaved bidirectional decoder (IBDecoder) retains the model simplicity and training efficiency of the standard Transformer.
arXiv Detail & Related papers (2020-10-27T17:38:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.