Related papers: AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization

AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization

URL: http://arxiv.org/abs/2409.16546v2
Date: Mon, 21 Oct 2024 05:06:01 GMT
Title: AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization
Authors: Yifan Tan, Haoze Wang, Chao Yan, Yangdong Deng,
Abstract summary: Mixed-precision quantization distinguishes between important and unimportant parameters. Existing approaches can only identify important parameters through qualitative analysis and manual experiments. We propose a new criterion, so-called 'precision alignment', to build a quantitative framework to holistically evaluate the importance of parameters.
Score: 5.572159724234467
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Model quantization has become a crucial technique to address the issues of large memory consumption and long inference times associated with LLMs. Mixed-precision quantization, which distinguishes between important and unimportant parameters, stands out among numerous quantization schemes as it achieves a balance between precision and compression rate. However, existing approaches can only identify important parameters through qualitative analysis and manual experiments without quantitatively analyzing how their importance is determined. We propose a new criterion, so-called 'precision alignment', to build a quantitative framework to holistically evaluate the importance of parameters in mixed-precision quantization. Our observations on floating point addition under various real-world scenarios suggest that two addends should have identical precision, otherwise the information in the higher-precision number will be wasted. Such an observation offers an essential principle to determine the precision of each parameter in matrix multiplication operation. As the first step towards applying the above discovery to large model inference, we develop a dynamic KV-Cache quantization technique to effectively reduce memory access latency. Different from existing quantization approaches that focus on memory saving, this work directly aims to accelerate LLM inference through quantifying floating numbers. The proposed technique attains a 25% saving of memory access and delivers up to 1.3x speedup in the computation of attention in the decoding phase of LLM, with almost no loss of precision.

Related papers

MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs [13.951330786310262]
FineQ is a software- hardware co-design for low-bit fine-grained mixed-precision quantization of large language models. It partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters. It achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width.
arXiv Detail & Related papers (2025-04-28T12:47:23Z)
Quantize What Counts: Bit Allocation Insights Informed by Spectral Gaps in Keys and Values [57.54443445583921]
We provide two novel theorems aimed at enhancing KV quantization methods.<n>Our first theorem, termed Key-Value Norm Disparity, states that the key weight matrices by nature carry richer information.<n>Our second theorem, Key-Driven Quantization, posits that prioritizing the quantization precision of keys over values induces significant improvements to the overall quantization performance.
arXiv Detail & Related papers (2025-02-20T22:24:27Z)
Channel-Wise Mixed-Precision Quantization for Large Language Models [47.00361921910259]
Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. We introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method.
arXiv Detail & Related papers (2024-10-16T21:34:41Z)
QSpec: Speculative Decoding with Complementary Quantization Schemes [37.007621357142725]
Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models. We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding. QSPEC empirically boosts token generation throughput by up to 1.80x without any quality compromise.
arXiv Detail & Related papers (2024-10-15T05:57:51Z)
Scaling Laws for Mixed quantization in Large Language Models [10.912306313183972]
Post-training quantization of Large Language Models (LLMs) has proven effective in reducing the computational requirements for running inference on these models. In this study, we focus on a straightforward question: When aiming for a specific accuracy or perplexity target for low-precision quantization, how many high-precision numbers or calculations are required to preserve as we scale LLMs to larger sizes?
arXiv Detail & Related papers (2024-10-09T09:45:01Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs) Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
Accurate Block Quantization in LLMs with Outliers [0.6138671548064355]
The demand for inference on extremely large scale LLMs has seen enormous growth in recent months. The problem is aggravated by the exploding raise in the lengths of the sequences being processed. Various quantization techniques have been proposed that allow accurate quantization for both weights and activations.
arXiv Detail & Related papers (2024-03-29T12:15:06Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing. Existing ultra-low-bit quantization always causes severe accuracy drops. We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy. We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR. Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.