Related papers: Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

URL: http://arxiv.org/abs/2309.08168v2
Date: Mon, 20 May 2024 02:37:20 GMT
Title: Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
Authors: Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Sharad Mehrotra,
Abstract summary: We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models. The proposed method requires no additional neural network training and no extra memory footprint. Benchmarks with LLaMA-2 and its variants demonstrated a speedup up to 1.99$times$.
Score: 25.03122689338891
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The drafting stage generates draft tokens at a slightly lower quality but more quickly, which is achieved by selectively skipping certain intermediate layers during drafting. Subsequently, the verification stage employs the original LLM to validate those draft output tokens in one forward pass. This process ensures the final output remains identical to that produced by the unaltered LLM. Moreover, the proposed method requires no additional neural network training and no extra memory footprint, making it a plug-and-play and cost-effective solution for inference acceleration. Benchmarks with LLaMA-2 and its variants demonstrated a speedup up to 1.99$\times$.

Related papers

DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding [7.204881999658682]
We introduce DEL, a plug-and-play method that adaptively selects the exit layer and speculation length during inference. Del achieves overall speedups of $2.16times$$sim$$2.50times$ over vanilla auto-regressive decoding.
arXiv Detail & Related papers (2025-04-08T01:12:59Z)
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts [79.62448915248926]
Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing accuracy over the 16-bit model inference. We propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4. In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline.
arXiv Detail & Related papers (2025-03-17T08:38:45Z)
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling [59.8051705468084]
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models. We present FR-Spec, a frequency-ranked speculative sampling framework that optimize draft candidate selection through vocabulary space compression.
arXiv Detail & Related papers (2025-02-20T18:58:10Z)
Real-time Verification and Refinement of Language Model Text Generation [60.04718679054704]
Large language models (LLMs) have shown remarkable performance across a wide range of natural language tasks. A critical challenge remains in that they sometimes generate factually incorrect answers. We propose Streaming-VR, a novel approach designed to enhance the efficiency of verification and refinement of LLM outputs.
arXiv Detail & Related papers (2025-01-14T03:59:48Z)
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference. We show that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z)
Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation [8.046705062670096]
Lossless speculative decoding accelerates target large language model inference. We propose FSPAD (Feature Sampling and Partial Alignment Distillation for Lossless Speculative Decoding) to boost speculative decoding. Our experiments include both greedy and non-greedy decoding on the largest and smallest models from the Vicuna and LLaMA3-Instruct series.
arXiv Detail & Related papers (2024-08-28T06:28:01Z)
PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL) PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase. Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z)
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [59.17158389902231]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference. This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z)
Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models. We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism [35.7077090639665]
We propose a novel approach called Early-exiting Speculative Decoding (EESD) with lossless acceleration. EESD utilizes a segment of the large language models (LLMs) to generate draft tokens, incorporating Early-exiting structures after the first N layers. We show that our approach decodes tokens at a markedly accelerated rate compared to prior methods.
arXiv Detail & Related papers (2024-06-06T08:40:28Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling. Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words. We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z)
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models. We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process. Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z)
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding. GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM. We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.