Mixed-Precision Neural Network Quantization via Learned Layer-wise
Importance
- URL: http://arxiv.org/abs/2203.08368v1
- Date: Wed, 16 Mar 2022 03:23:50 GMT
- Title: Mixed-Precision Neural Network Quantization via Learned Layer-wise
Importance
- Authors: Chen Tang and Kai Ouyang and Zhi Wang and Yifei Zhu and Yaowei Wang
and Wen Ji and Wenwu Zhu
- Abstract summary: Mixed-precision quantization (MPQ) makes it hard to determine the optimal bit-width for each layer.
We propose a joint training scheme that can obtain all indicators at once.
For example, MPQ search on ResNet18 with our indicators takes only 0.06 seconds.
- Score: 50.00102219630088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The exponentially large discrete search space in mixed-precision quantization
(MPQ) makes it hard to determine the optimal bit-width for each layer. Previous
works usually resort to iterative search methods on the training set, which
consume hundreds or even thousands of GPU-hours. In this study, we reveal that
some unique learnable parameters in quantization, namely the scale factors in
the quantizer, can serve as importance indicators of a layer, reflecting the
contribution of that layer to the final accuracy at certain bit-widths. These
importance indicators naturally perceive the numerical transformation during
quantization-aware training, which can precisely and correctly provide
quantization sensitivity metrics of layers. However, a deep network always
contains hundreds of such indicators, and training them one by one would lead
to an excessive time cost. To overcome this issue, we propose a joint training
scheme that can obtain all indicators at once. It considerably speeds up the
indicators training process by parallelizing the original sequential training
processes. With these learned importance indicators, we formulate the MPQ
search problem as a one-time integer linear programming (ILP) problem. That
avoids the iterative search and significantly reduces search time without
limiting the bit-width search space. For example, MPQ search on ResNet18 with
our indicators takes only 0.06 seconds. Also, extensive experiments show our
approach can achieve SOTA accuracy on ImageNet for far-ranging models with
various constraints (e.g., BitOps, compress rate).
Related papers
- Mixed-Precision Quantization with Cross-Layer Dependencies [6.338965603383983]
Mixed-precision quantization (MPQ) assigns varied bit-widths to layers to optimize the accuracy-efficiency trade-off.
Existing methods simplify the MPQ problem by assuming that quantization errors at different layers act independently.
We show that this assumption does not reflect the true behavior of quantized deep neural networks.
arXiv Detail & Related papers (2023-07-11T15:56:00Z) - Diffused Redundancy in Pre-trained Representations [98.55546694886819]
We take a closer look at how features are encoded in pre-trained representations.
We find that learned representations in a given layer exhibit a degree of diffuse redundancy.
Our findings shed light on the nature of representations learned by pre-trained deep neural networks.
arXiv Detail & Related papers (2023-05-31T21:00:50Z) - Quantune: Post-training Quantization of Convolutional Neural Networks
using Extreme Gradient Boosting for Fast Deployment [15.720551497037176]
We propose an auto-tuner known as Quantune to accelerate the search for the configurations of quantization.
We show that Quantune reduces the search time for quantization by approximately 36.5x with an accuracy loss of 0.07 0.65% across six CNN models.
arXiv Detail & Related papers (2022-02-10T14:05:02Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - Effective and Fast: A Novel Sequential Single Path Search for
Mixed-Precision Quantization [45.22093693422085]
Mixed-precision quantization model can match different quantization bit-precisions according to the sensitivity of different layers to achieve great performance.
It is a difficult problem to quickly determine the quantization bit-precision of each layer in deep neural networks according to some constraints.
We propose a novel sequential single path search (SSPS) method for mixed-precision quantization.
arXiv Detail & Related papers (2021-03-04T09:15:08Z) - Applications of Koopman Mode Analysis to Neural Networks [52.77024349608834]
We consider the training process of a neural network as a dynamical system acting on the high-dimensional weight space.
We show how the Koopman spectrum can be used to determine the number of layers required for the architecture.
We also show how using Koopman modes we can selectively prune the network to speed up the training procedure.
arXiv Detail & Related papers (2020-06-21T11:00:04Z) - Post-Training Piecewise Linear Quantization for Deep Neural Networks [13.717228230596167]
Quantization plays an important role in the energy-efficient deployment of deep neural networks on resource-limited devices.
We propose a piecewise linear quantization scheme to enable accurate approximation for tensor values that have bell-shaped distributions with long tails.
Compared to state-of-the-art post-training quantization methods, our proposed method achieves superior performance on image classification, semantic segmentation, and object detection with minor overhead.
arXiv Detail & Related papers (2020-01-31T23:47:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.