Related papers: QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering

QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering

URL: http://arxiv.org/abs/2407.03637v4
Date: Fri, 6 Sep 2024 08:28:01 GMT
Title: QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering
Authors: Yanshu Wang, Wang Li, Zhaoqian Yao, Tong Yang,
Abstract summary: We formulate the Quantization Error Minimization problem as minimizing the distance between a matrix before and after quantization. Matrix quantization is crucial in various applications, including Large Language Models (LLMs) weight quantization, vector databases, KV cache quantization, graph compression, and image compression. We propose Quantum Entanglement Trees (QET) to address the QEM problem by leveraging the local orderliness of matrix elements.
Score: 5.363038867793461
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The matrix quantization entails representing matrix elements in a more space-efficient form to reduce storage usage, with dequantization restoring the original matrix for use. We formulate the Quantization Error Minimization (QEM) problem as minimizing the distance between a matrix before and after quantization, under the condition that the quantized matrix occupies the same memory space. Matrix quantization is crucial in various applications, including Large Language Models (LLMs) weight quantization, vector databases, KV cache quantization, graph compression, and image compression. Recent advancements in LLMs, such as GPT-4 and BERT, have highlighted the importance of matrix compression due to the large size of parameters and KV cache, which are stored as matrices. We propose Quantum Entanglement Trees (QET) to address the QEM problem by leveraging the local orderliness of matrix elements, involving iterative element swapping to form a locally ordered matrix. This matrix is then grouped and quantized by columns. To enhance QET, we introduce two optimizations: further quantizing residuals to reduce MSE, and using masking and batch processing to accelerate the algorithm. Experimental results demonstrate that QET can effectively reduce MSE to 5.05%, 13.33%, and 11.89% of the current best method on the LLM dataset, K cache, and V cache, respectively. Our contributions include the abstraction of the QEM problem, the design of the QET algorithm, and the proposal of two optimizations to improve accuracy and speed.

Related papers

More for Keys, Less for Values: Adaptive KV Cache Quantization [59.708443710731146]
This paper introduces an information-aware quantization framework that adaptively compresses the key-value cache in large language models. We show that key matrices consistently exhibit higher norm values and are more sensitive to quantization than value matrices. We propose a mixed-precision quantization strategy, KV-AdaQuant, which allocates more bitwidth for keys and fewer for values.
arXiv Detail & Related papers (2025-02-20T22:24:27Z)
Memory-Efficient 4-bit Preconditioned Stochastic Optimization [53.422307389223626]
We introduce 4-bit quantization for Shampoo's preconditioners. To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners. We demonstrate that combining Cholesky quantization with error feedback enhances memory efficiency and algorithm performance.
arXiv Detail & Related papers (2024-12-14T03:32:54Z)
MVQ:Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization [8.057807176915896]
A novel approach called MVQ is proposed, which aims at better approximating important weights with a limited number of codewords. Our algorithm is validated on various models for image classification, object detection, and segmentation tasks. Under ASIC evaluation, our MVQ accelerator boosts energy efficiency by 2.3$times$ and reduces the size of the systolic array by 55% when compared with the base EWS accelerator.
arXiv Detail & Related papers (2024-12-13T16:30:35Z)
SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution [4.14360329494344]
We present a novel approach for accelerating convolutions during inference for CPU-based architectures. Our experiments with commonly used network architectures demonstrate a significant speedup compared to existing indirect methods.
arXiv Detail & Related papers (2024-11-23T21:43:38Z)
Residual vector quantization for KV cache compression in large language model [2.3094645821058735]
KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding. In this work, we apply residual vector quantization, which has been widely used for high fidelity audio compression, to compress KV cache in large language models (LLM) We learn the codebook using exponential moving average and there are no other learnable parameters including the input and output projections normally used in a vector quantization set up.
arXiv Detail & Related papers (2024-10-21T07:20:41Z)
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations [36.63586957377984]
Large language models often require substantial storage space. Due to their massive parameter count, these models often require substantial storage space. One research direction proposes to compress the models using integer replacements for floating-point numbers.
arXiv Detail & Related papers (2024-10-17T04:35:57Z)
Quantization-aware Matrix Factorization for Low Bit Rate Image Compression [8.009813033356478]
Lossy image compression is essential for efficient transmission and storage. We introduce a quantization-aware matrix factorization (QMF) to develop a novel lossy image compression method. Our method consistently outperforms JPEG at low bit rates below 0.25 bits per pixel (bpp) and remains comparable at higher bit rates.
arXiv Detail & Related papers (2024-08-22T19:08:08Z)
Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs. At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [67.67135738642547]
Post-training quantization (PTQ) is a powerful compression technique investigated in large language models (LLMs) Existing PTQ methods are not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
OneBit: Towards Extremely Low-bit Large Language Models [66.29839811207617]
This paper boldly quantizes the weight matrices of LLMs to 1-bit, paving the way for the extremely low bit-width deployment of LLMs. Experiments indicate that OneBit achieves good performance (at least 81% of the non-quantized performance on LLaMA models) with robust training processes.
arXiv Detail & Related papers (2024-02-17T14:26:57Z)
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [66.85589263870702]
Our approach uses an iterative algorithm to decompose each pretrained matrix into a high-precision low-rank component and a memory-efficient quantized component. Experiments on finetuning RoBERTa and LLaMA-2 demonstrate that our low-rank plus quantized matrix decomposition approach (LQ-LoRA) outperforms strong QLoRA and GPTQ-LoRA baselines.
arXiv Detail & Related papers (2023-11-20T18:57:41Z)
Sufficient dimension reduction for feature matrices [3.04585143845864]
We propose a method called principal support matrix machine (PSMM) for the matrix sufficient dimension reduction. Our numerical analysis demonstrates that the PSMM outperforms existing methods and has strong interpretability in real data applications.
arXiv Detail & Related papers (2023-03-07T23:16:46Z)
A quantum algorithm for solving eigenproblem of the Laplacian matrix of a fully connected weighted graph [4.045204834863644]
We propose an efficient quantum algorithm to solve the eigenproblem of the Laplacian matrix of a fully connected weighted graph. Specifically, we adopt the optimal Hamiltonian simulation technique based on the block-encoding framework. We also show that our algorithm can be extended to solve the eigenproblem of symmetric (non-symmetric) normalized Laplacian matrix.
arXiv Detail & Related papers (2022-03-28T02:24:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.