Related papers: Quantized Sparse Weight Decomposition for Neural Network Compression

Quantized Sparse Weight Decomposition for Neural Network Compression

URL: http://arxiv.org/abs/2207.11048v1
Date: Fri, 22 Jul 2022 12:40:03 GMT
Title: Quantized Sparse Weight Decomposition for Neural Network Compression
Authors: Andrey Kuzmin, Mart van Baalen, Markus Nagel, Arash Behboodi
Abstract summary: We show that this approach can be seen as a unification of weight SVD, vector quantization, and sparse PCA. Our method is applicable to both moderate compression regimes, unlike vector quantization, and extreme compression regimes.
Score: 12.24566619983231
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce a novel method of neural network weight compression. In our method, we store weight tensors as sparse, quantized matrix factors, whose product is computed on the fly during inference to generate the target model's weights. We use projected gradient descent methods to find quantized and sparse factorization of the weight tensors. We show that this approach can be seen as a unification of weight SVD, vector quantization, and sparse PCA. Combined with end-to-end fine-tuning our method exceeds or is on par with previous state-of-the-art methods in terms of the trade-off between accuracy and model size. Our method is applicable to both moderate compression regimes, unlike vector quantization, and extreme compression regimes.

Related papers

Compression Scaling Laws:Unifying Sparsity and Quantization [65.05818215339498]
We investigate how different compression techniques affect the scaling behavior of large language models (LLMs) during pretraining. We show that weight-only quantization achieves strong parameter efficiency multipliers, while full quantization of both weights and activations shows diminishing returns at lower bitwidths. Our results suggest that different compression techniques can be unified under a common scaling law framework.
arXiv Detail & Related papers (2025-02-23T04:47:36Z)
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
Diffusion Product Quantization [18.32568431229839]
We explore the quantization of diffusion models in extreme compression regimes to reduce model size while maintaining performance. We apply our compression method to the DiT model on ImageNet and consistently outperform other quantization approaches.
arXiv Detail & Related papers (2024-11-19T07:47:37Z)
Convolutional Neural Network Compression Based on Low-Rank Decomposition [3.3295360710329738]
This paper proposes a model compression method that integrates Variational Bayesian Matrix Factorization. VBMF is employed to estimate the rank of the weight tensor at each layer. Experimental results show that for both high and low compression ratios, our compression model exhibits advanced performance.
arXiv Detail & Related papers (2024-08-29T06:40:34Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
Quantization of Large Language Models with an Overdetermined Basis [73.79368761182998]
We introduce an algorithm for data quantization based on the principles of Kashin representation. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
arXiv Detail & Related papers (2024-04-15T12:38:46Z)
Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z)
Quantization Aware Factorization for Deep Neural Network Compression [20.04951101799232]
decomposition of convolutional and fully-connected layers is an effective way to reduce parameters and FLOP in neural networks. A conventional post-training quantization approach applied to networks with weights yields a drop in accuracy. This motivated us to develop an algorithm that finds decomposed approximation directly with quantized factors.
arXiv Detail & Related papers (2023-08-08T21:38:02Z)
BiTAT: Neural Network Binarization with Task-dependent Aggregated Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation. Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration. This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z)
Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z)
Robust Tensor Principal Component Analysis: Exact Recovery via Deterministic Model [5.414544833902815]
This paper proposes a new method to analyze Robust tensor principal component analysis (RTPCA) It is based on the recently developed tensor-tensor product and tensor singular value decomposition (t-SVD)
arXiv Detail & Related papers (2020-08-05T16:26:10Z)
Exploiting Weight Redundancy in CNNs: Beyond Pruning and Quantization [0.2538209532048866]
Pruning and quantization are proven methods for improving the performance and storage efficiency of convolutional neural networks (CNNs) We identify another form of redundancy in CNN weight tensors, in the form of repeated patterns of similar values.
arXiv Detail & Related papers (2020-06-22T01:54:04Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.