GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative
Decoding
- URL: http://arxiv.org/abs/2402.02082v1
- Date: Sat, 3 Feb 2024 08:44:11 GMT
- Title: GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative
Decoding
- Authors: Cunxiao Du, Jing Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li,
Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, Yang You
- Abstract summary: We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding.
GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM.
We will release our code, data, and the trained draft models.
- Score: 81.01996600734616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speculative decoding is a relatively new decoding framework that leverages
small and efficient draft models to reduce the latency of LLMs. In this study,
we introduce GliDe and CaPE, two low-hassle modifications to vanilla
speculative decoding to further improve the decoding speed of a frozen LLM.
Specifically, GliDe is a modified draft model architecture that reuses the
cached keys and values from the target LLM, while CaPE is a proposal expansion
method that uses the draft model's confidence scores to help select additional
candidate tokens for verification. Extensive experiments on different
benchmarks demonstrate that our proposed GliDe draft model significantly
reduces the expected decoding latency. Additional evaluation using walltime
reveals that GliDe can accelerate Vicuna models up to 2.17x and further extend
the improvement to 2.61x with CaPE. We will release our code, data, and the
trained draft models.
Related papers
- Efficient Inference for Large Language Model-based Generative Recommendation [78.38878421030522]
Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly.
Applying Speculative Decoding (SD) to generative recommendation presents unique challenges due to the requirement of generating top-K items.
We propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification.
arXiv Detail & Related papers (2024-10-07T16:23:36Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.
This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs [11.245862832561176]
Training a high-quality draft model is required to enable inference acceleration via speculative decoding.
We train Llama 2 Chat Drafter 115M, a draft model for Llama 2 Chat 7B or larger, with only 1.64% of the original size.
Our results show that Llama 2 Chat Drafter 115M with speculative decoding achieves up to 2.3 block efficiency and 2.4$times$ speed-up.
arXiv Detail & Related papers (2024-02-29T19:55:06Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - Cascade Speculative Drafting for Even Faster LLM Inference [25.642604897018852]
Speculative decoding improves the efficiency of large language model (LLM) inference.
We introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades.
CS Drafting achieves up to an 81 percent additional speedup over speculative decoding in our experiments.
arXiv Detail & Related papers (2023-12-18T18:59:46Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - Speculative Decoding: Exploiting Speculative Execution for Accelerating
Seq2seq Generation [80.2267931231335]
We propose Speculative Decoding (SpecDec) to study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding.
SpecDec has two innovations: Spec-Drafter -- an independent model specially optimized for efficient drafting, and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently.
arXiv Detail & Related papers (2022-03-30T17:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.