Pruning and Quantization for Deep Neural Network Acceleration: A Survey
- URL: http://arxiv.org/abs/2101.09671v2
- Date: Thu, 11 Mar 2021 03:39:39 GMT
- Title: Pruning and Quantization for Deep Neural Network Acceleration: A Survey
- Authors: Tailin Liang, John Glossner, Lei Wang, Shaobo Shi
- Abstract summary: Deep neural networks have been applied in many applications exhibiting extraordinary abilities in the field of computer vision.
Complex network architectures challenge efficient real-time deployment and require significant computation resources and energy costs.
This paper provides a survey on two types of network compression: pruning and quantization.
- Score: 2.805723049889524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks have been applied in many applications exhibiting
extraordinary abilities in the field of computer vision. However, complex
network architectures challenge efficient real-time deployment and require
significant computation resources and energy costs. These challenges can be
overcome through optimizations such as network compression. Network compression
can often be realized with little loss of accuracy. In some cases accuracy may
even improve. This paper provides a survey on two types of network compression:
pruning and quantization. Pruning can be categorized as static if it is
performed offline or dynamic if it is performed at run-time. We compare pruning
techniques and describe criteria used to remove redundant computations. We
discuss trade-offs in element-wise, channel-wise, shape-wise, filter-wise,
layer-wise and even network-wise pruning. Quantization reduces computations by
reducing the precision of the datatype. Weights, biases, and activations may be
quantized typically to 8-bit integers although lower bit width implementations
are also discussed including binary neural networks. Both pruning and
quantization can be used independently or combined. We compare current
techniques, analyze their strengths and weaknesses, present compressed network
accuracy results on a number of frameworks, and provide practical guidance for
compressing networks.
Related papers
- A Theoretical Understanding of Neural Network Compression from Sparse
Linear Approximation [37.525277809849776]
The goal of model compression is to reduce the size of a large neural network while retaining a comparable performance.
We use sparsity-sensitive $ell_q$-norm to characterize compressibility and provide a relationship between soft sparsity of the weights in the network and the degree of compression.
We also develop adaptive algorithms for pruning each neuron in the network informed by our theory.
arXiv Detail & Related papers (2022-06-11T20:10:35Z) - Post-training Quantization for Neural Networks with Provable Guarantees [9.58246628652846]
We modify a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism.
We prove that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights.
arXiv Detail & Related papers (2022-01-26T18:47:38Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - Compact representations of convolutional neural networks via weight
pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization.
We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z) - Image Complexity Guided Network Compression for Biomedical Image
Segmentation [5.926887379656135]
We propose an image complexity-guided network compression technique for biomedical image segmentation.
We map the dataset complexity to the target network accuracy degradation caused by compression.
The mapping is used to determine the convolutional layer-wise multiplicative factor for generating a compressed network.
arXiv Detail & Related papers (2021-07-06T22:28:10Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - A White Paper on Neural Network Quantization [20.542729144379223]
We introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network's performance.
We consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT)
arXiv Detail & Related papers (2021-06-15T17:12:42Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - Benchmarking Quantized Neural Networks on FPGAs with FINN [0.42439262432068253]
Using lower precision comes at the cost of negligible loss in accuracy.
This article aims to assess the impact of mixed-precision when applied to neural networks deployed on FPGAs.
arXiv Detail & Related papers (2021-02-02T06:42:07Z) - ESPN: Extremely Sparse Pruned Networks [50.436905934791035]
We show that a simple iterative mask discovery method can achieve state-of-the-art compression of very deep networks.
Our algorithm represents a hybrid approach between single shot network pruning methods and Lottery-Ticket type approaches.
arXiv Detail & Related papers (2020-06-28T23:09:27Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.