Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
- URL: http://arxiv.org/abs/2505.22179v2
- Date: Thu, 29 May 2025 04:07:33 GMT
- Title: Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
- Authors: Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu,
- Abstract summary: Speculative decoding and quantization effectively accelerate memory-bound inference of large language models.<n>Quantization achieves this by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications.<n>Experiments show that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding.
- Score: 34.04231165571518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.
Related papers
- CCQ: Convolutional Code for Extreme Low-bit Quantization in LLMs [25.32003624625106]
Convolutional Code Quantization (CCQ) is an inference-optimized quantization approach compressing Large Language Models to 2.0-2.75 bits with minimal accuracy loss.<n>We construct a lookup-free encoding space, enabling a linear mapping between the codebook and weight.<n> Experiments demonstrate that CCQ achieves outstanding performance on LLMs across various benchmarks.
arXiv Detail & Related papers (2025-07-09T06:04:14Z) - Capturing the Effects of Quantization on Trojans in Code LLMs [12.814581766967047]
We investigate the impact of quantization on the risk of data poisoning attacks on large language models of code.<n>We find that quantization has differing effects on code-generating LLMs.<n>We introduce a new metric for measuring trojan signals in compromised models.
arXiv Detail & Related papers (2025-05-20T11:01:14Z) - Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - An Efficient Quantum Classifier Based on Hamiltonian Representations [50.467930253994155]
Quantum machine learning (QML) is a discipline that seeks to transfer the advantages of quantum computing to data-driven tasks.<n>We propose an efficient approach that circumvents the costs associated with data encoding by mapping inputs to a finite set of Pauli strings.<n>We evaluate our approach on text and image classification tasks, against well-established classical and quantum models.
arXiv Detail & Related papers (2025-04-13T11:49:53Z) - SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [58.5019443418822]
Diffusion models can generate high-quality images, but as they scale, rising memory demands and higher latency pose deployment challenges.<n>We propose SVDQuant, a new 4-bit quantization paradigm to overcome this limitation.<n>We reduce the memory usage for the 12B FLUX.1 models by 3.5$times$, achieving 3.0$times$ speedup over the 4-bit weight-only quantization (W4A16) baseline.
arXiv Detail & Related papers (2024-11-07T18:59:58Z) - GPTQT: Quantize Large Language Models Twice to Push the Efficiency [1.3149617027696827]
This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed.
Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting.
GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding.
arXiv Detail & Related papers (2024-07-03T08:08:01Z) - FrameQuant: Flexible Low-Bit Quantization for Transformers [25.569106620123346]
Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks.
Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower.
We show, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains.
arXiv Detail & Related papers (2024-03-10T04:01:49Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [57.27101446992148]
Large language models (LLMs) have revolutionized natural language processing tasks.
Recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM.
We introduce an Omnidirectionally calibrated Quantization technique for LLMs, which achieves good performance in diverse quantization settings.
arXiv Detail & Related papers (2023-08-25T02:28:35Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.