Related papers: AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations

AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations

URL: http://arxiv.org/abs/2410.13212v1
Date: Thu, 17 Oct 2024 04:35:57 GMT
Title: AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
Authors: Qian Tao, Wenyuan Yu, Jingren Zhou,
Abstract summary: Large language models often require substantial storage space. Due to their massive parameter count, these models often require substantial storage space. One research direction proposes to compress the models using integer replacements for floating-point numbers.
Score: 36.63586957377984
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models have shown exceptional capabilities in a wide range of tasks, such as text generation and video generation, among others. However, due to their massive parameter count, these models often require substantial storage space, imposing significant constraints on the machines deploying LLMs. To overcome this limitation, one research direction proposes to compress the models using integer replacements for floating-point numbers, in a process known as Quantization. Some recent studies suggest quantizing the key and value cache (KV Cache) of LLMs, and designing quantization techniques that treat the key and value matrices equivalently. This work delves deeper into the asymmetric structural roles of KV Cache, a phenomenon where the transformer's output loss is more sensitive to the quantization of key matrices. We conduct a systematic examination of the attention output error resulting from key and value quantization. The phenomenon inspires us to propose an asymmetric quantization strategy. Our approach allows for 1-bit quantization of the KV cache by implementing distinct configurations for key and value matrices. We carry out experiments across a variety of datasets, demonstrating that our proposed model allows for the quantization of up to 75% decoder layers with 1 bit, while simultaneously maintaining performance levels comparable to those of the models with floating parameters.

Related papers

Single-Qudit Quantum Neural Networks for Multiclass Classification [0.0]
This paper proposes a single-qudit quantum neural network for multiclass classification. Our design employs an $d$-dimensional unitary operator, where $d$ corresponds to the number of classes. We evaluate our model on the MNIST and EMNIST datasets, demonstrating competitive accuracy.
arXiv Detail & Related papers (2025-03-12T11:12:05Z)
More for Keys, Less for Values: Adaptive KV Cache Quantization [59.708443710731146]
This paper introduces an information-aware quantization framework that adaptively compresses the key-value cache in large language models. We show that key matrices consistently exhibit higher norm values and are more sensitive to quantization than value matrices. We propose a mixed-precision quantization strategy, KV-AdaQuant, which allocates more bitwidth for keys and fewer for values.
arXiv Detail & Related papers (2025-02-20T22:24:27Z)
Residual vector quantization for KV cache compression in large language model [2.3094645821058735]
KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding. In this work, we apply residual vector quantization, which has been widely used for high fidelity audio compression, to compress KV cache in large language models (LLM) We learn the codebook using exponential moving average and there are no other learnable parameters including the input and output projections normally used in a vector quantization set up.
arXiv Detail & Related papers (2024-10-21T07:20:41Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Enhancing the performance of Variational Quantum Classifiers with hybrid autoencoders [0.0]
We propose an alternative method which reduces the dimensionality of a given dataset by taking into account the specific quantum embedding that comes after. This method aspires to make quantum machine learning with VQCs more versatile and effective on datasets of high dimension.
arXiv Detail & Related papers (2024-09-05T08:51:20Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization [8.827794405944637]
Post-training quantization (PTQ) is a promising solution for compressing large transformer models. Existing PTQ methods typically exhibit non-trivial performance loss. We propose RepQuant, a novel PTQ framework with quantization-inference decoupling paradigm.
arXiv Detail & Related papers (2024-02-08T12:35:41Z)
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision. Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z)
Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers. We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.