HAWQV3: Dyadic Neural Network Quantization
- URL: http://arxiv.org/abs/2011.10680v3
- Date: Wed, 23 Jun 2021 07:49:12 GMT
- Title: HAWQV3: Dyadic Neural Network Quantization
- Authors: Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric
Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, Kurt Keutzer
- Abstract summary: Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
- Score: 73.11579145354801
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current low-precision quantization algorithms often have the hidden cost of
conversion back and forth from floating point to quantized integer values. This
hidden cost limits the latency improvement realized by quantizing Neural
Networks. To address this, we present HAWQV3, a novel mixed-precision
integer-only quantization framework. The contributions of HAWQV3 are the
following: (i) An integer-only inference where the entire computational graph
is performed only with integer multiplication, addition, and bit shifting,
without any floating point operations or even integer division; (ii) A novel
hardware-aware mixed-precision quantization method where the bit-precision is
calculated by solving an integer linear programming problem that balances the
trade-off between model perturbation and other constraints, e.g., memory
footprint and latency; (iii) Direct hardware deployment and open source
contribution for 4-bit uniform/mixed-precision quantization in TVM, achieving
an average speed up of $1.45\times$ for uniform 4-bit, as compared to uniform
8-bit for ResNet50 on T4 GPUs; and (iv) extensive evaluation of the proposed
methods on ResNet18/50 and InceptionV3, for various model compression levels
with/without mixed precision. For ResNet50, our INT8 quantization achieves an
accuracy of $77.58\%$, which is $2.68\%$ higher than prior integer-only work,
and our mixed-precision INT4/8 quantization can reduce INT8 latency by $23\%$
and still achieve $76.73\%$ accuracy. Our framework and the TVM implementation
have been open sourced.
Related papers
- FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search [50.07268323597872]
We propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models.
With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90% with equivalent model cost over previous methods.
For the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models.
arXiv Detail & Related papers (2023-08-07T04:17:19Z) - DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural
Network Inference [28.912023025671868]
This work targets an adaptive data representation with variable-length encoding called DyBit.
We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup.
Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization.
arXiv Detail & Related papers (2023-02-24T08:46:01Z) - Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded
Chipsets [7.5195830365852085]
We propose a novel sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model.
We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data.
arXiv Detail & Related papers (2022-07-13T17:46:08Z) - F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization [47.403304754934155]
We present F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication.
Our approach achieves comparable and better performance, when compared with existing quantization techniques.
arXiv Detail & Related papers (2022-02-10T18:48:56Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - Pareto-Optimal Quantized ResNet Is Mostly 4-bit [3.83996783171716]
We use ResNet as a case study to investigate the effects of quantization on inference compute cost-quality tradeoff curves.
Our results suggest that for each bfloat16 ResNet model, there are quantized models with lower cost and higher accuracy.
We achieve state-of-the-art results on ImageNet for 4-bit ResNet-50 with quantization-aware training, obtaining a top-1 eval accuracy of 77.09%.
arXiv Detail & Related papers (2021-05-07T23:28:37Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision
Neural Network Inference [7.886868529510128]
Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors.
Excessive quantization, reducing precision too aggressively, results in accuracy degradation.
Per-vector scale factors can be implemented with low-bitwidth integers when using a two-level quantization scheme.
arXiv Detail & Related papers (2021-02-08T19:56:04Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.