Transform Quantization for CNN (Convolutional Neural Network)
Compression
- URL: http://arxiv.org/abs/2009.01174v4
- Date: Sun, 7 Nov 2021 17:29:26 GMT
- Title: Transform Quantization for CNN (Convolutional Neural Network)
Compression
- Authors: Sean I. Young, Wang Zhe, David Taubman, and Bernd Girod
- Abstract summary: We optimally transform weights post-training using a rate-distortion framework to improve compression at any given quantization bit-rate.
We show that transform quantization advances the state of the art in CNN compression in both retrained and non-retrained quantization scenarios.
- Score: 26.62351408292294
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we compress convolutional neural network (CNN) weights
post-training via transform quantization. Previous CNN quantization techniques
tend to ignore the joint statistics of weights and activations, producing
sub-optimal CNN performance at a given quantization bit-rate, or consider their
joint statistics during training only and do not facilitate efficient
compression of already trained CNN models. We optimally transform (decorrelate)
and quantize the weights post-training using a rate-distortion framework to
improve compression at any given quantization bit-rate. Transform quantization
unifies quantization and dimensionality reduction (decorrelation) techniques in
a single framework to facilitate low bit-rate compression of CNNs and efficient
inference in the transform domain. We first introduce a theory of rate and
distortion for CNN quantization, and pose optimum quantization as a
rate-distortion optimization problem. We then show that this problem can be
solved using optimal bit-depth allocation following decorrelation by the
optimal End-to-end Learned Transform (ELT) we derive in this paper. Experiments
demonstrate that transform quantization advances the state of the art in CNN
compression in both retrained and non-retrained quantization scenarios. In
particular, we find that transform quantization with retraining is able to
compress CNN models such as AlexNet, ResNet and DenseNet to very low bit-rates
(1-2 bits).
Related papers
- GHN-QAT: Training Graph Hypernetworks to Predict Quantization-Robust
Parameters of Unseen Limited Precision Neural Networks [80.29667394618625]
Graph Hypernetworks (GHN) can predict the parameters of varying unseen CNN architectures with surprisingly good accuracy.
Preliminary research has explored the use of GHNs to predict quantization-robust parameters for 8-bit and 4-bit quantized CNNs.
We show that quantization-aware training can significantly improve quantized accuracy for GHN predicted parameters of 4-bit quantized CNNs.
arXiv Detail & Related papers (2023-09-24T23:01:00Z) - Quantization Aware Factorization for Deep Neural Network Compression [20.04951101799232]
decomposition of convolutional and fully-connected layers is an effective way to reduce parameters and FLOP in neural networks.
A conventional post-training quantization approach applied to networks with weights yields a drop in accuracy.
This motivated us to develop an algorithm that finds decomposed approximation directly with quantized factors.
arXiv Detail & Related papers (2023-08-08T21:38:02Z) - PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization [12.136898590792754]
We analyze the problems of quantization on vision transformers.
We propose the twin uniform quantization method to reduce the quantization error on these activation values.
Experiments show the quantized vision transformers achieve near-lossless prediction accuracy (less than 0.5% drop at 8-bit quantization) on the ImageNet classification task.
arXiv Detail & Related papers (2021-11-24T06:23:06Z) - Compact representations of convolutional neural networks via weight
pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization.
We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z) - Fixed-point Quantization of Convolutional Neural Networks for Quantized
Inference on Embedded Platforms [0.9954382983583577]
We propose a method to optimally quantize the weights, biases and activations of each layer of a pre-trained CNN.
We find that layer-wise quantization of parameters significantly helps in this process.
arXiv Detail & Related papers (2021-02-03T17:05:55Z) - Direct Quantization for Training Highly Accurate Low Bit-width Deep
Neural Networks [73.29587731448345]
This paper proposes two novel techniques to train deep convolutional neural networks with low bit-width weights and activations.
First, to obtain low bit-width weights, most existing methods obtain the quantized weights by performing quantization on the full-precision network weights.
Second, to obtain low bit-width activations, existing works consider all channels equally.
arXiv Detail & Related papers (2020-12-26T15:21:18Z) - Where Should We Begin? A Low-Level Exploration of Weight Initialization
Impact on Quantized Behaviour of Deep Neural Networks [93.4221402881609]
We present an in-depth, fine-grained ablation study of the effect of different weights initialization on the final distributions of weights and activations of different CNN architectures.
To our best knowledge, we are the first to perform such a low-level, in-depth quantitative analysis of weights initialization and its effect on quantized behaviour.
arXiv Detail & Related papers (2020-11-30T06:54:28Z) - Permute, Quantize, and Fine-tune: Efficient Compression of Neural
Networks [70.0243910593064]
Key to success of vector quantization is deciding which parameter groups should be compressed together.
In this paper we make the observation that the weights of two adjacent layers can be permuted while expressing the same function.
We then establish a connection to rate-distortion theory and search for permutations that result in networks that are easier to compress.
arXiv Detail & Related papers (2020-10-29T15:47:26Z) - A QP-adaptive Mechanism for CNN-based Filter in Video Coding [26.1307267761763]
This paper presents a generic method to help an arbitrary CNN-filter handle different quantization noise.
When the quantization noise increases, the ability of the CNN-filter to suppress noise improves accordingly.
An additional BD-rate reduction of 0.2% is achieved by our proposed method for chroma components.
arXiv Detail & Related papers (2020-10-25T08:02:38Z) - Exploiting Weight Redundancy in CNNs: Beyond Pruning and Quantization [0.2538209532048866]
Pruning and quantization are proven methods for improving the performance and storage efficiency of convolutional neural networks (CNNs)
We identify another form of redundancy in CNN weight tensors, in the form of repeated patterns of similar values.
arXiv Detail & Related papers (2020-06-22T01:54:04Z) - Gradient $\ell_1$ Regularization for Quantization Robustness [70.39776106458858]
We derive a simple regularization scheme that improves robustness against post-training quantization.
By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths.
arXiv Detail & Related papers (2020-02-18T12:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.