QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering
- URL: http://arxiv.org/abs/2407.03637v4
- Date: Fri, 6 Sep 2024 08:28:01 GMT
- Title: QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering
- Authors: Yanshu Wang, Wang Li, Zhaoqian Yao, Tong Yang,
- Abstract summary: We formulate the Quantization Error Minimization problem as minimizing the distance between a matrix before and after quantization.
Matrix quantization is crucial in various applications, including Large Language Models (LLMs) weight quantization, vector databases, KV cache quantization, graph compression, and image compression.
We propose Quantum Entanglement Trees (QET) to address the QEM problem by leveraging the local orderliness of matrix elements.
- Score: 5.363038867793461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The matrix quantization entails representing matrix elements in a more space-efficient form to reduce storage usage, with dequantization restoring the original matrix for use. We formulate the Quantization Error Minimization (QEM) problem as minimizing the distance between a matrix before and after quantization, under the condition that the quantized matrix occupies the same memory space. Matrix quantization is crucial in various applications, including Large Language Models (LLMs) weight quantization, vector databases, KV cache quantization, graph compression, and image compression. Recent advancements in LLMs, such as GPT-4 and BERT, have highlighted the importance of matrix compression due to the large size of parameters and KV cache, which are stored as matrices. We propose Quantum Entanglement Trees (QET) to address the QEM problem by leveraging the local orderliness of matrix elements, involving iterative element swapping to form a locally ordered matrix. This matrix is then grouped and quantized by columns. To enhance QET, we introduce two optimizations: further quantizing residuals to reduce MSE, and using masking and batch processing to accelerate the algorithm. Experimental results demonstrate that QET can effectively reduce MSE to 5.05%, 13.33%, and 11.89% of the current best method on the LLM dataset, K cache, and V cache, respectively. Our contributions include the abstraction of the QEM problem, the design of the QET algorithm, and the proposal of two optimizations to improve accuracy and speed.
Related papers
- Reducing QUBO Density by Factoring Out Semi-Symmetries [4.581191399651181]
We introduce the concept of textitsemi-symmetries in QUBO matrices.
We show that our algorithm reduces the number of couplings and circuit depth by up to $45%.
arXiv Detail & Related papers (2024-12-18T12:05:18Z) - Memory-Efficient 4-bit Preconditioned Stochastic Optimization [53.422307389223626]
We introduce 4-bit quantization for Shampoo's preconditioners.
To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners.
arXiv Detail & Related papers (2024-12-14T03:32:54Z) - MVQ:Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization [8.057807176915896]
A novel approach called MVQ is proposed, which aims at better approximating important weights with a limited number of codewords.
Our algorithm is validated on various models for image classification, object detection, and segmentation tasks.
Under ASIC evaluation, our MVQ accelerator boosts energy efficiency by 2.3$times$ and reduces the size of the systolic array by 55% when compared with the base EWS accelerator.
arXiv Detail & Related papers (2024-12-13T16:30:35Z) - Residual vector quantization for KV cache compression in large language model [2.3094645821058735]
KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding.
In this work, we apply residual vector quantization, which has been widely used for high fidelity audio compression, to compress KV cache in large language models (LLM)
We learn the codebook using exponential moving average and there are no other learnable parameters including the input and output projections normally used in a vector quantization set up.
arXiv Detail & Related papers (2024-10-21T07:20:41Z) - AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations [36.63586957377984]
Large language models often require substantial storage space.
Due to their massive parameter count, these models often require substantial storage space.
One research direction proposes to compress the models using integer replacements for floating-point numbers.
arXiv Detail & Related papers (2024-10-17T04:35:57Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs)
Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths.
This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs.
Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z) - LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [66.85589263870702]
Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component.
Experiments on finetuning RoBERTa and LLaMA-2 demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines.
arXiv Detail & Related papers (2023-11-20T18:57:41Z) - A quantum algorithm for solving eigenproblem of the Laplacian matrix of
a fully connected weighted graph [4.045204834863644]
We propose an efficient quantum algorithm to solve the eigenproblem of the Laplacian matrix of a fully connected weighted graph.
Specifically, we adopt the optimal Hamiltonian simulation technique based on the block-encoding framework.
We also show that our algorithm can be extended to solve the eigenproblem of symmetric (non-symmetric) normalized Laplacian matrix.
arXiv Detail & Related papers (2022-03-28T02:24:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.