ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
- URL: http://arxiv.org/abs/2503.13565v1
- Date: Mon, 17 Mar 2025 08:38:45 GMT
- Title: ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
- Authors: Evangelos Georganas, Dhiraj Kalamkar, Alexander Kozlov, Alexander Heinecke,
- Abstract summary: Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing accuracy over the 16-bit model inference.<n>We propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4.<n>In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline.
- Score: 79.62448915248926
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing any accuracy over the 16-bit model inference. In a typical SD setup, the idea is to use a full-precision, small, fast model as "draft" to generate the next few tokens and use the "target" large model to verify the draft-generated tokens. The efficacy of this method heavily relies on the acceptance ratio of the draft-generated tokens and the relative token throughput of the draft versus the target model. Nevertheless, an efficient SD pipeline requires pre-training and aligning the draft model to the target model, making it impractical for LLM inference in a plug-and-play fashion. In this work, we propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4. In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline. Then we pursue an opportunity for further acceleration: the MXFP4 draft token generation itself can be accelerated via speculative decoding by using yet another smaller draft. We call our method ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts since it recursively applies speculation for accelerating the draft-token generation. Combining Multi-Level Speculative Decoding with MXFP4 Quantized Drafts we outperform state-of-the-art speculative decoding, yielding speedups up to 2.72x over the BF16 baseline.
Related papers
- PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation [4.031603850949324]
We introduce a novel speculative decoding method that enables low-cost adaptation of autoregressive draft models into parallel draft models.
Our proposed conditional drop token method can improve draft model training efficiency by 3x.
On our optimized inference framework, PARD accelerates LLaMA3.1-8B inference by 4.08x, achieving 311.5 tokens per second.
arXiv Detail & Related papers (2025-04-23T12:27:43Z) - The Power of Negative Zero: Datatype Customization for Quantized Large Language Models [5.503925076208333]
Post-training quantization serves as one of the most hardware-efficient methods to mitigate the memory and computational demands of large language models (LLMs)<n>In this paper, we extend the basic FP datatype to perform Redundant Zero Remapping (RaZeR)<n>RaZeR remaps the negative zero FP encoding to a set of pre-defined special values to maximally utilize FP quantization encodings and to better fit numerical distributions.
arXiv Detail & Related papers (2025-01-06T22:40:40Z) - PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL)<n>PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase.<n> Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models.
We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process.
Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z) - GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative
Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding.
GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM.
We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z) - Decoding Speculative Decoding [4.56754610152086]
Speculative Decoding is a technique to speed up inference for Large Language Models without sacrificing quality.<n>We study over 350 experiments with LLaMA-65B and OPT-66B using speculative decoding.<n>Our newly designed draft model can provide 111% higher throughput than existing draft models.
arXiv Detail & Related papers (2024-02-02T16:15:24Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits.
We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit.
We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - Online Speculative Decoding [34.987825705622555]
We introduce online speculative decoding to accelerate the inference of large language models.
The main idea is to continuously update the (multiple) draft model(s) on observed user query data.
We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data.
arXiv Detail & Related papers (2023-10-11T04:03:42Z) - Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding [25.03122689338891]
We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models.
The proposed method requires no additional neural network training and no extra memory footprint.
Benchmarks with LLaMA-2 and its variants demonstrated a speedup up to 1.99$times$.
arXiv Detail & Related papers (2023-09-15T05:34:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.