Quantized Sparse Weight Decomposition for Neural Network Compression
- URL: http://arxiv.org/abs/2207.11048v1
- Date: Fri, 22 Jul 2022 12:40:03 GMT
- Title: Quantized Sparse Weight Decomposition for Neural Network Compression
- Authors: Andrey Kuzmin, Mart van Baalen, Markus Nagel, Arash Behboodi
- Abstract summary: We show that this approach can be seen as a unification of weight SVD, vector quantization, and sparse PCA.
Our method is applicable to both moderate compression regimes, unlike vector quantization, and extreme compression regimes.
- Score: 12.24566619983231
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a novel method of neural network weight
compression. In our method, we store weight tensors as sparse, quantized matrix
factors, whose product is computed on the fly during inference to generate the
target model's weights. We use projected gradient descent methods to find
quantized and sparse factorization of the weight tensors. We show that this
approach can be seen as a unification of weight SVD, vector quantization, and
sparse PCA. Combined with end-to-end fine-tuning our method exceeds or is on
par with previous state-of-the-art methods in terms of the trade-off between
accuracy and model size. Our method is applicable to both moderate compression
regimes, unlike vector quantization, and extreme compression regimes.
Related papers
- Diffusion Product Quantization [18.32568431229839]
We explore the quantization of diffusion models in extreme compression regimes to reduce model size while maintaining performance.
We apply our compression method to the DiT model on ImageNet and consistently outperform other quantization approaches.
arXiv Detail & Related papers (2024-11-19T07:47:37Z) - Convolutional Neural Network Compression Based on Low-Rank Decomposition [3.3295360710329738]
This paper proposes a model compression method that integrates Variational Bayesian Matrix Factorization.
VBMF is employed to estimate the rank of the weight tensor at each layer.
Experimental results show that for both high and low compression ratios, our compression model exhibits advanced performance.
arXiv Detail & Related papers (2024-08-29T06:40:34Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - Quantization of Large Language Models with an Overdetermined Basis [73.79368761182998]
We introduce an algorithm for data quantization based on the principles of Kashin representation.
Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
arXiv Detail & Related papers (2024-04-15T12:38:46Z) - Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.
We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - Quantization Aware Factorization for Deep Neural Network Compression [20.04951101799232]
decomposition of convolutional and fully-connected layers is an effective way to reduce parameters and FLOP in neural networks.
A conventional post-training quantization approach applied to networks with weights yields a drop in accuracy.
This motivated us to develop an algorithm that finds decomposed approximation directly with quantized factors.
arXiv Detail & Related papers (2023-08-08T21:38:02Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - Robust Tensor Principal Component Analysis: Exact Recovery via
Deterministic Model [5.414544833902815]
This paper proposes a new method to analyze Robust tensor principal component analysis (RTPCA)
It is based on the recently developed tensor-tensor product and tensor singular value decomposition (t-SVD)
arXiv Detail & Related papers (2020-08-05T16:26:10Z) - Exploiting Weight Redundancy in CNNs: Beyond Pruning and Quantization [0.2538209532048866]
Pruning and quantization are proven methods for improving the performance and storage efficiency of convolutional neural networks (CNNs)
We identify another form of redundancy in CNN weight tensors, in the form of repeated patterns of similar values.
arXiv Detail & Related papers (2020-06-22T01:54:04Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.