RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise
Mixed Schemes and Multiple Precisions
- URL: http://arxiv.org/abs/2111.00153v1
- Date: Sat, 30 Oct 2021 02:53:35 GMT
- Title: RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise
Mixed Schemes and Multiple Precisions
- Authors: Sung-En Chang, Yanyu Li, Mengshu Sun, Weiwen Jiang, Sijia Liu, Yanzhi
Wang, Xue Lin
- Abstract summary: This work proposes a novel Deep Neural Network (DNN) quantization framework, namely RMSMP, with a Row-wise Mixed-Scheme and Multi-Precision approach.
The proposed RMSMP is tested for the image classification and natural language processing (BERT) applications.
It achieves the best accuracy performance among state-of-the-arts under the same equivalent precisions.
- Score: 43.27226390407956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work proposes a novel Deep Neural Network (DNN) quantization framework,
namely RMSMP, with a Row-wise Mixed-Scheme and Multi-Precision approach.
Specifically, this is the first effort to assign mixed quantization schemes and
multiple precisions within layers -- among rows of the DNN weight matrix, for
simplified operations in hardware inference, while preserving accuracy.
Furthermore, this paper makes a different observation from the prior work that
the quantization error does not necessarily exhibit the layer-wise sensitivity,
and actually can be mitigated as long as a certain portion of the weights in
every layer are in higher precisions. This observation enables layer-wise
uniformality in the hardware implementation towards guaranteed inference
acceleration, while still enjoying row-wise flexibility of mixed schemes and
multiple precisions to boost accuracy. The candidates of schemes and precisions
are derived practically and effectively with a highly hardware-informative
strategy to reduce the problem search space. With the offline determined ratio
of different quantization schemes and precisions for all the layers, the RMSMP
quantization algorithm uses the Hessian and variance-based method to
effectively assign schemes and precisions for each row. The proposed RMSMP is
tested for the image classification and natural language processing (BERT)
applications and achieves the best accuracy performance among state-of-the-arts
under the same equivalent precisions. The RMSMP is implemented on FPGA devices,
achieving 3.65x speedup in the end-to-end inference time for ResNet-18 on
ImageNet, compared with the 4-bit Fixed-point baseline.
Related papers
- Mixed-Precision Quantization with Cross-Layer Dependencies [6.338965603383983]
Mixed-precision quantization (MPQ) assigns varied bit-widths to layers to optimize the accuracy-efficiency trade-off.
Existing methods simplify the MPQ problem by assuming that quantization errors at different layers act independently.
We show that this assumption does not reflect the true behavior of quantized deep neural networks.
arXiv Detail & Related papers (2023-07-11T15:56:00Z) - Efficient and Effective Methods for Mixed Precision Neural Network
Quantization for Faster, Energy-efficient Inference [3.3213055774512648]
Quantizing networks to lower precision is a powerful technique for simplifying networks.
Mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance.
To estimate the impact of layer precision choice on task performance, two methods are introduced.
Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers.
arXiv Detail & Related papers (2023-01-30T23:26:33Z) - CSQ: Growing Mixed-Precision Quantization Scheme with Bi-level
Continuous Sparsification [51.81850995661478]
Mixed-precision quantization has been widely applied on deep neural networks (DNNs)
Previous attempts on bit-level regularization and pruning-based dynamic precision adjustment during training suffer from noisy gradients and unstable convergence.
We propose Continuous Sparsification Quantization (CSQ), a bit-level training method to search for mixed-precision quantization schemes with improved stability.
arXiv Detail & Related papers (2022-12-06T05:44:21Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - ILMPQ : An Intra-Layer Multi-Precision Deep Neural Network Quantization
framework for FPGA [37.780528948703406]
This work targets the commonly used FPGA (field-programmable gate array) devices as the hardware platform for DNN edge computing.
We use a quantization method that supports multiple precisions along the intra-layer dimension.
We achieve 3.65x speedup in end-to-end inference time on the ImageNet, compared with the fixed-point quantization method.
arXiv Detail & Related papers (2021-10-30T03:02:52Z) - Fully Quantized Image Super-Resolution Networks [81.75002888152159]
We propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy.
We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR.
Our FQSR using low bits quantization can achieve on par performance compared with the full-precision counterparts on five benchmark datasets.
arXiv Detail & Related papers (2020-11-29T03:53:49Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - Efficient Bitwidth Search for Practical Mixed Precision Neural Network [33.80117489791902]
Network quantization has rapidly become one of the most widely used methods to compress and accelerate deep neural networks.
Recent efforts propose to quantize weights and activations from different layers with different precision to improve the overall performance.
It is challenging to find the optimal bitwidth (i.e., precision) for weights and activations of each layer efficiently.
It is yet unclear how to perform convolution for weights and activations of different precision efficiently on generic hardware platforms.
arXiv Detail & Related papers (2020-03-17T08:27:48Z) - Post-training Quantization with Multiple Points: Mixed Precision without
Mixed Precision [20.081543082708688]
We propose multipoint quantization, a method that approximates a full-precision weight vector using a linear combination of multiple vectors of low-bit numbers.
We show that our method outperforms a range of state-of-the-art methods on ImageNet classification and it can be generalized to more challenging tasks like PASCAL VOC object detection.
arXiv Detail & Related papers (2020-02-20T22:37:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.