Cross-filter compression for CNN inference acceleration
- URL: http://arxiv.org/abs/2005.09034v1
- Date: Mon, 18 May 2020 19:06:14 GMT
- Title: Cross-filter compression for CNN inference acceleration
- Authors: Fuyuan Lyu, Shien Zhu, Weichen Liu
- Abstract summary: We propose a new cross-filter compression method that can provide $sim32times$ memory savings and $122times$ speed up in convolution operations.
Our method, based on Binary-Weight and XNOR-Net separately, is evaluated on CIFAR-10 and ImageNet dataset.
- Score: 4.324080238456531
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolution neural network demonstrates great capability for multiple tasks,
such as image classification and many others. However, much resource is
required to train a network. Hence much effort has been made to accelerate
neural network by reducing precision of weights, activation, and gradient.
However, these filter-wise quantification methods exist a natural upper limit,
caused by the size of the kernel. Meanwhile, with the popularity of small
kernel, the natural limit further decrease. To address this issue, we propose a
new cross-filter compression method that can provide $\sim32\times$ memory
savings and $122\times$ speed up in convolution operations. In our method, all
convolution filters are quantized to given bits and spatially adjacent filters
share the same scaling factor. Our compression method, based on Binary-Weight
and XNOR-Net separately, is evaluated on CIFAR-10 and ImageNet dataset with
widely used network structures, such as ResNet and VGG, and witness tolerable
accuracy loss compared to state-of-the-art quantification methods.
Related papers
- RedBit: An End-to-End Flexible Framework for Evaluating the Accuracy of
Quantized CNNs [9.807687918954763]
Convolutional Neural Networks (CNNs) have become the standard class of deep neural network for image processing, classification and segmentation tasks.
RedBit is an open-source framework that provides a transparent, easy-to-use interface to evaluate the effectiveness of different algorithms on network accuracy.
arXiv Detail & Related papers (2023-01-15T21:27:35Z) - Approximating Continuous Convolutions for Deep Network Compression [11.566258236184964]
We present ApproxConv, a novel method for compressing the layers of a convolutional neural network.
We show that our method is able to compress existing deep network models by half whilst losing only 1.86% accuracy.
arXiv Detail & Related papers (2022-10-17T11:41:26Z) - Compact representations of convolutional neural networks via weight
pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization.
We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Permute, Quantize, and Fine-tune: Efficient Compression of Neural
Networks [70.0243910593064]
Key to success of vector quantization is deciding which parameter groups should be compressed together.
In this paper we make the observation that the weights of two adjacent layers can be permuted while expressing the same function.
We then establish a connection to rate-distortion theory and search for permutations that result in networks that are easier to compress.
arXiv Detail & Related papers (2020-10-29T15:47:26Z) - Compressing Deep Convolutional Neural Networks by Stacking
Low-dimensional Binary Convolution Filters [15.66437882635872]
Deep Convolutional Neural Networks (CNN) have been successfully applied to many real-life problems.
Huge memory cost of deep CNN models poses a great challenge of deploying them on memory-constrained devices.
We propose a novel method to compress deep CNN model by stacking low-dimensional binary convolution filters.
arXiv Detail & Related papers (2020-10-06T14:49:22Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z) - Kernel Quantization for Efficient Network Compression [59.55192551370948]
Kernel Quantization (KQ) aims to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss.
Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level.
Experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer.
arXiv Detail & Related papers (2020-03-11T08:00:04Z) - Computational optimization of convolutional neural networks using
separated filters architecture [69.73393478582027]
We consider a convolutional neural network transformation that reduces computation complexity and thus speedups neural network processing.
Use of convolutional neural networks (CNN) is the standard approach to image recognition despite the fact they can be too computationally demanding.
arXiv Detail & Related papers (2020-02-18T17:42:13Z) - Pruning CNN's with linear filter ensembles [0.0]
We use pruning to reduce the network size and -- implicitly -- the number of floating point operations (FLOPs)
We develop a novel filter importance norm that is based on the change in the empirical loss caused by the presence or removal of a component from the network architecture.
We evaluate our method on a fully connected network, as well as on the ResNet architecture trained on the CIFAR-10 dataset.
arXiv Detail & Related papers (2020-01-22T16:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.