Mixed-Precision Quantization with Cross-Layer Dependencies
- URL: http://arxiv.org/abs/2307.05657v1
- Date: Tue, 11 Jul 2023 15:56:00 GMT
- Title: Mixed-Precision Quantization with Cross-Layer Dependencies
- Authors: Zihao Deng, Xin Wang, Sayeh Sharify, Michael Orshansky
- Abstract summary: Mixed-precision quantization (MPQ) assigns varied bit-widths to layers to optimize the accuracy-efficiency trade-off.
Existing methods simplify the MPQ problem by assuming that quantization errors at different layers act independently.
We show that this assumption does not reflect the true behavior of quantized deep neural networks.
- Score: 6.338965603383983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization is commonly used to compress and accelerate deep neural
networks. Quantization assigning the same bit-width to all layers leads to
large accuracy degradation at low precision and is wasteful at high precision
settings. Mixed-precision quantization (MPQ) assigns varied bit-widths to
layers to optimize the accuracy-efficiency trade-off. Existing methods simplify
the MPQ problem by assuming that quantization errors at different layers act
independently. We show that this assumption does not reflect the true behavior
of quantized deep neural networks. We propose the first MPQ algorithm that
captures the cross-layer dependency of quantization error. Our algorithm
(CLADO) enables a fast approximation of pairwise cross-layer error terms by
solving linear equations that require only forward evaluations of the network
on a small amount of data. Decisions on layerwise bit-width assignments are
then determined by optimizing a new MPQ formulation dependent on these
cross-layer quantization errors via the Integer Quadratic Program (IQP), which
can be solved within seconds. We conduct experiments on multiple networks on
the Imagenet dataset and demonstrate an improvement, in top-1 classification
accuracy, of up to 27% over uniform precision quantization, and up to 15% over
existing MPQ methods.
Related papers
- Automatic Network Adaptation for Ultra-Low Uniform-Precision
Quantization [6.1664476076961146]
Uniform-precision neural network quantization has gained popularity since it simplifies densely packed arithmetic unit for high computing capability.
It ignores heterogeneous sensitivity to the impact of quantization errors across the layers, resulting in sub-optimal inference.
This work proposes a novel neural architecture search called neural channel expansion that adjusts the network structure to alleviate accuracy degradation from ultra-low uniform-precision quantization.
arXiv Detail & Related papers (2022-12-21T09:41:25Z) - CSMPQ:Class Separability Based Mixed-Precision Quantization [9.005098065862411]
A novel mixed-precision quantization method, termed CSMPQ, is proposed.
Specifically, the TF-IDF metric that is widely used in natural language processing (NLP) is introduced to measure the class separability of layer-wise feature maps.
Without any iterative process, the proposed CSMPQ achieves better compression trade-offs than the state-of-the-art quantization methods.
arXiv Detail & Related papers (2022-12-20T12:52:19Z) - Mixed-Precision Neural Network Quantization via Learned Layer-wise
Importance [50.00102219630088]
Mixed-precision quantization (MPQ) makes it hard to determine the optimal bit-width for each layer.
We propose a joint training scheme that can obtain all indicators at once.
For example, MPQ search on ResNet18 with our indicators takes only 0.06 seconds.
arXiv Detail & Related papers (2022-03-16T03:23:50Z) - Post-training Quantization for Neural Networks with Provable Guarantees [9.58246628652846]
We modify a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism.
We prove that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights.
arXiv Detail & Related papers (2022-01-26T18:47:38Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise
Mixed Schemes and Multiple Precisions [43.27226390407956]
This work proposes a novel Deep Neural Network (DNN) quantization framework, namely RMSMP, with a Row-wise Mixed-Scheme and Multi-Precision approach.
The proposed RMSMP is tested for the image classification and natural language processing (BERT) applications.
It achieves the best accuracy performance among state-of-the-arts under the same equivalent precisions.
arXiv Detail & Related papers (2021-10-30T02:53:35Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network
Quantization [32.770842274996774]
Mixed-precision quantization can potentially achieve the optimal tradeoff between performance and compression rate of deep neural networks.
Previous methods either examine only a small manually-designed search space or utilize a cumbersome neural architecture search to explore the vast search space.
This work proposes bit-level sparsity quantization (BSQ) to tackle the mixed-precision quantization from a new angle of inducing bit-level sparsity.
arXiv Detail & Related papers (2021-02-20T22:37:41Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - APQ: Joint Search for Network Architecture, Pruning and Quantization
Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware.
Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner.
With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.