Related papers: MVQ:Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization

MVQ:Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization

URL: http://arxiv.org/abs/2412.10261v2
Date: Mon, 16 Dec 2024 08:54:43 GMT
Title: MVQ:Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization
Authors: Shuaiting Li, Chengxuan Wang, Juncan Deng, Zeyu Wang, Zewen Ye, Zongsheng Wang, Haibin Shen, Kejie Huang,
Abstract summary: A novel approach called MVQ is proposed, which aims at better approximating important weights with a limited number of codewords.<n>Our algorithm is validated on various models for image classification, object detection, and segmentation tasks.<n>Under ASIC evaluation, our MVQ accelerator boosts energy efficiency by 2.3$times$ and reduces the size of the systolic array by 55% when compared with the base EWS accelerator.
Score: 8.057807176915896
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vector quantization(VQ) is a hardware-friendly DNN compression method that can reduce the storage cost and weight-loading datawidth of hardware accelerators. However, conventional VQ techniques lead to significant accuracy loss because the important weights are not well preserved. To tackle this problem, a novel approach called MVQ is proposed, which aims at better approximating important weights with a limited number of codewords. At the algorithm level, our approach removes the less important weights through N:M pruning and then minimizes the vector clustering error between the remaining weights and codewords by the masked k-means algorithm. Only distances between the unpruned weights and the codewords are computed, which are then used to update the codewords. At the architecture level, our accelerator implements vector quantization on an EWS (Enhanced weight stationary) CNN accelerator and proposes a sparse systolic array design to maximize the benefits brought by masked vector quantization.\\ Our algorithm is validated on various models for image classification, object detection, and segmentation tasks. Experimental results demonstrate that MVQ not only outperforms conventional vector quantization methods at comparable compression ratios but also reduces FLOPs. Under ASIC evaluation, our MVQ accelerator boosts energy efficiency by 2.3$\times$ and reduces the size of the systolic array by 55\% when compared with the base EWS accelerator. Compared to the previous sparse accelerators, MVQ achieves 1.73$\times$ higher energy efficiency.

Related papers

SAQ: Pushing the Limits of Vector Quantization through Code Adjustment and Dimension Segmentation [13.282924439395204]
Approximate Nearest Neighbor Search (ANNS) plays a critical role in applications such as search engines, recommender systems, and RAG for LLMs.<n> Vector quantization (VQ) is commonly used to reduce space overhead and accelerate distance computations.<n>We propose a novel VQ method called SAQ to balance encoding efficiency and quantization accuracy.<n>We show SAQ achieves up to 80% reduction in quantization error and accelerates encoding speed by over 80x compared to Extended RabitQ.
arXiv Detail & Related papers (2025-09-15T16:14:05Z)
Reducing Storage of Pretrained Neural Networks by Rate-Constrained Quantization and Entropy Coding [56.066799081747845]
The ever-growing size of neural networks poses serious challenges on resource-constrained devices.<n>We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding.<n>Our method allows for very fast decoding and is compatible with arbitrary quantization grids.
arXiv Detail & Related papers (2025-05-24T15:52:49Z)
Automatic mixed precision for optimizing gained time with constrained loss mean-squared-error based on model partition to sequential sub-graphs [0.8999666725996975]
Mixed Precision (MP) mitigates the tradeoff by varying numerical precision across network layers.<n>This study focuses on automatically selecting an optimal MP configuration within Post-Training Quantization (PTQ) for inference.
arXiv Detail & Related papers (2025-05-19T12:51:02Z)
High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution [87.56382172827526]
High-frequency regions are most critical for reconstruction.<n>We propose a training-free adaptive masking module for acceleration.<n>Our method reduces FLOPs by 24--43% for state-of-the-art models.
arXiv Detail & Related papers (2025-05-11T13:18:03Z)
FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs [13.951330786310262]
FineQ is a software- hardware co-design for low-bit fine-grained mixed-precision quantization of large language models. It partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters. It achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width.
arXiv Detail & Related papers (2025-04-28T12:47:23Z)
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z)
Pyramid Vector Quantization for LLMs [8.779688608449902]
Pyramid Vector Quantization (PVQ) for large language models.<n>PVQ uses a fixed integer lattice on the sphere by projecting points onto the 1-sphere, which allows for efficient encoding and decoding without requiring an explicit codebook in memory.<n>We achieve state-of-the-art quantization performance with pareto-optimal trade-off between performance and bits per weight and bits per activation, compared to compared methods.
arXiv Detail & Related papers (2024-10-22T11:57:32Z)
Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders. We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z)
QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering [5.363038867793461]
We formulate the Quantization Error Minimization problem as minimizing the distance between a matrix before and after quantization. Matrix quantization is crucial in various applications, including Large Language Models (LLMs) weight quantization, vector databases, KV cache quantization, graph compression, and image compression. We propose Quantum Entanglement Trees (QET) to address the QEM problem by leveraging the local orderliness of matrix elements.
arXiv Detail & Related papers (2024-07-04T05:13:58Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
Post-Training Quantization for Re-parameterization via Coarse & Fine Weight Splitting [13.270381125055275]
We propose a coarse & fine weight splitting (CFWS) method to reduce quantization error of weight. We develop an improved KL metric to determine optimal quantization scales for activation. For example, the quantized RepVGG-A1 model exhibits a mere 0.3% accuracy loss.
arXiv Detail & Related papers (2023-12-17T02:31:20Z)
BiTAT: Neural Network Binarization with Task-dependent Aggregated Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation. Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration. This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z)
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models [9.727062803700264]
We introduce LUT-GEMM, an efficient kernel for quantized matrix multiplication. LUT-GEMM eliminates the resource-intensive dequantization process and reduces computational costs. We show experimentally that when applied to the OPT-175B model with 3-bit quantization, LUT-GEMM substantially accelerates token generation latency.
arXiv Detail & Related papers (2022-06-20T03:48:17Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization [19.66522714831141]
We develop a new quantization solution called VecQ, which can guarantee minimal direct quantization loss and better model accuracy. In addition, in order to up the proposed quantization process during training, we accelerate the quantization process with a parameterized estimation and probability-based calculation.
arXiv Detail & Related papers (2020-05-18T07:38:44Z)
Kernel Quantization for Efficient Network Compression [59.55192551370948]
Kernel Quantization (KQ) aims to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss. Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level. Experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer.
arXiv Detail & Related papers (2020-03-11T08:00:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.