Related papers: QSpec: Speculative Decoding with Complementary Quantization Schemes

QSpec: Speculative Decoding with Complementary Quantization Schemes

URL: http://arxiv.org/abs/2410.11305v1
Date: Tue, 15 Oct 2024 05:57:51 GMT
Title: QSpec: Speculative Decoding with Complementary Quantization Schemes
Authors: Juntao Zhao, Wenhao Lu, Sheng Wang, Lingpeng Kong, Chuan Wu,
Abstract summary: Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models. We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding. QSPEC empirically boosts token generation throughput by up to 1.80x without any quality compromise.
Score: 37.007621357142725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models (LLMs). While activation-weight joint quantization speeds up the inference process through low-precision kernels, we demonstrate that it suffers severe performance degradation on multi-step reasoning tasks, rendering it ineffective. We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding. Leveraging nearly cost-free execution switching, QSPEC drafts tokens with low-precision, fast activation-weight quantization, and verifies them with high-precision weight-only quantization, effectively combining the strengths of both quantization schemes. Compared to high-precision quantization methods, QSPEC empirically boosts token generation throughput by up to 1.80x without any quality compromise, distinguishing it from other low-precision quantization approaches. This enhancement is also consistent across various serving tasks, model sizes, quantization methods, and batch sizes. Unlike existing speculative decoding techniques, our approach reuses weights and the KV cache, avoiding additional memory overhead. Furthermore, QSPEC offers a plug-and-play advantage without requiring any training. We believe that QSPEC demonstrates unique strengths for future deployment of high-fidelity quantization schemes, particularly in memory-constrained scenarios (e.g., edge devices).

Related papers

MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
Channel-Wise Mixed-Precision Quantization for Large Language Models [47.00361921910259]
Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. We introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method.
arXiv Detail & Related papers (2024-10-16T21:34:41Z)
AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization [5.572159724234467]
Mixed-precision quantization distinguishes between important and unimportant parameters. Existing approaches can only identify important parameters through qualitative analysis and manual experiments. We propose a new criterion, so-called 'precision alignment', to build a quantitative framework to holistically evaluate the importance of parameters.
arXiv Detail & Related papers (2024-09-25T01:39:02Z)
PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks [4.827161693957252]
Non-quantized elementwise operations dominate the inference cost of low-precision models. PikeLPN model addresses these issues by applying quantization to both elementwise operations and multiply-accumulate operations.
arXiv Detail & Related papers (2024-03-29T18:23:34Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge [40.85258685379659]
Post-Training Quantization (PTQ) methods degrade in quality when quantizing weights, activations, and KV cache together to below 8 bits. Many Quantization-Aware Training (QAT) works quantize model weights, leaving the activations untouched, which do not fully exploit the potential of quantization for inference acceleration on the edge. We propose EdgeQAT, the Entropy and Distribution Guided QAT for the optimization of lightweight LLMs to achieve inference acceleration on Edge devices.
arXiv Detail & Related papers (2024-02-16T16:10:38Z)
RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization [8.827794405944637]
Post-training quantization (PTQ) is a promising solution for compressing large transformer models. Existing PTQ methods typically exhibit non-trivial performance loss. We propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm.
arXiv Detail & Related papers (2024-02-08T12:35:41Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant. PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z)
CSQ: Growing Mixed-Precision Quantization Scheme with Bi-level Continuous Sparsification [51.81850995661478]
Mixed-precision quantization has been widely applied on deep neural networks (DNNs) Previous attempts on bit-level regularization and pruning-based dynamic precision adjustment during training suffer from noisy gradients and unstable convergence. We propose Continuous Sparsification Quantization (CSQ), a bit-level training method to search for mixed-precision quantization schemes with improved stability.
arXiv Detail & Related papers (2022-12-06T05:44:21Z)
Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks. DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons. We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z)
Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy. We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR. Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.