Kernel Quantization for Efficient Network Compression
- URL: http://arxiv.org/abs/2003.05148v1
- Date: Wed, 11 Mar 2020 08:00:04 GMT
- Title: Kernel Quantization for Efficient Network Compression
- Authors: Zhongzhi Yu, Yemin Shi, Tiejun Huang, Yizhou Yu
- Abstract summary: Kernel Quantization (KQ) aims to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss.
Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level.
Experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer.
- Score: 59.55192551370948
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel network compression framework Kernel Quantization
(KQ), targeting to efficiently convert any pre-trained full-precision
convolutional neural network (CNN) model into a low-precision version without
significant performance loss. Unlike existing methods struggling with weight
bit-length, KQ has the potential in improving the compression ratio by
considering the convolution kernel as the quantization unit. Inspired by the
evolution from weight pruning to filter pruning, we propose to quantize in both
kernel and weight level. Instead of representing each weight parameter with a
low-bit index, we learn a kernel codebook and replace all kernels in the
convolution layer with corresponding low-bit indexes. Thus, KQ can represent
the weight tensor in the convolution layer with low-bit indexes and a kernel
codebook with limited size, which enables KQ to achieve significant compression
ratio. Then, we conduct a 6-bit parameter quantization on the kernel codebook
to further reduce redundancy. Extensive experiments on the ImageNet
classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG
and ResNet18, respectively, to represent each parameter in the convolution
layer and achieves the state-of-the-art compression ratio with little accuracy
loss.
Related papers
- 2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment.
It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts.
We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z) - "Lossless" Compression of Deep Neural Networks: A High-dimensional
Neural Tangent Kernel Approach [49.744093838327615]
We provide a novel compression approach to wide and fully-connected emphdeep neural nets.
Experiments on both synthetic and real-world data are conducted to support the advantages of the proposed compression scheme.
arXiv Detail & Related papers (2024-03-01T03:46:28Z) - CADyQ: Content-Aware Dynamic Quantization for Image Super-Resolution [55.50793823060282]
We propose a novel Content-Aware Dynamic Quantization (CADyQ) method for image super-resolution (SR) networks.
CADyQ allocates optimal bits to local regions and layers adaptively based on the local contents of an input image.
The pipeline has been tested on various SR networks and evaluated on several standard benchmarks.
arXiv Detail & Related papers (2022-07-21T07:50:50Z) - A Theoretical Understanding of Neural Network Compression from Sparse
Linear Approximation [37.525277809849776]
The goal of model compression is to reduce the size of a large neural network while retaining a comparable performance.
We use sparsity-sensitive $ell_q$-norm to characterize compressibility and provide a relationship between soft sparsity of the weights in the network and the degree of compression.
We also develop adaptive algorithms for pruning each neuron in the network informed by our theory.
arXiv Detail & Related papers (2022-06-11T20:10:35Z) - OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization [32.60139548889592]
We propose a novel One-shot Pruning-Quantization (OPQ) in this paper.
OPQ analytically solves the compression allocation with pre-trained weight parameters only.
We propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook.
arXiv Detail & Related papers (2022-05-23T09:05:25Z) - Compact representations of convolutional neural networks via weight
pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization.
We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - Automated Model Compression by Jointly Applied Pruning and Quantization [14.824593320721407]
In the traditional deep compression framework, iteratively performing network pruning and quantization can reduce the model size and computation cost.
We tackle this issue by integrating network pruning and quantization as a unified joint compression problem and then use AutoML to automatically solve it.
We propose the automated model compression by jointly applied pruning and quantization (AJPQ)
arXiv Detail & Related papers (2020-11-12T07:06:29Z) - Cross-filter compression for CNN inference acceleration [4.324080238456531]
We propose a new cross-filter compression method that can provide $sim32times$ memory savings and $122times$ speed up in convolution operations.
Our method, based on Binary-Weight and XNOR-Net separately, is evaluated on CIFAR-10 and ImageNet dataset.
arXiv Detail & Related papers (2020-05-18T19:06:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.