ZeroQ: A Novel Zero Shot Quantization Framework
- URL: http://arxiv.org/abs/2001.00281v1
- Date: Wed, 1 Jan 2020 23:58:26 GMT
- Title: ZeroQ: A Novel Zero Shot Quantization Framework
- Authors: Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney,
Kurt Keutzer
- Abstract summary: Quantization is a promising approach for reducing the inference time and memory footprint of neural networks.
Existing zero-shot quantization methods use different epochs to address this, but they result in poor performance.
Here, we propose ZeroQ, a novel zero-shot quantization framework to address this.
- Score: 83.63606876854168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization is a promising approach for reducing the inference time and
memory footprint of neural networks. However, most existing quantization
methods require access to the original training dataset for retraining during
quantization. This is often not possible for applications with sensitive or
proprietary data, e.g., due to privacy and security concerns. Existing
zero-shot quantization methods use different heuristics to address this, but
they result in poor performance, especially when quantizing to ultra-low
precision. Here, we propose ZeroQ , a novel zero-shot quantization framework to
address this. ZeroQ enables mixed-precision quantization without any access to
the training or validation data. This is achieved by optimizing for a Distilled
Dataset, which is engineered to match the statistics of batch normalization
across different layers of the network. ZeroQ supports both uniform and
mixed-precision quantization. For the latter, we introduce a novel Pareto
frontier based method to automatically determine the mixed-precision bit
setting for all layers, with no manual search involved. We extensively test our
proposed method on a diverse set of models, including ResNet18/50/152,
MobileNetV2, ShuffleNet, SqueezeNext, and InceptionV3 on ImageNet, as well as
RetinaNet-ResNet50 on the Microsoft COCO dataset. In particular, we show that
ZeroQ can achieve 1.71\% higher accuracy on MobileNetV2, as compared to the
recently proposed DFQ method. Importantly, ZeroQ has a very low computational
overhead, and it can finish the entire quantization process in less than 30s
(0.5\% of one epoch training time of ResNet50 on ImageNet). We have
open-sourced the ZeroQ
framework\footnote{https://github.com/amirgholami/ZeroQ}.
Related papers
- FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search [50.07268323597872]
We propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models.
With integer models, we increase the accuracy of ResNet-18 on ImageNet by 1.31% and ResNet-50 by 0.90% with equivalent model cost over previous methods.
For the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% compared to prior state-of-the-art FP8 models.
arXiv Detail & Related papers (2023-08-07T04:17:19Z) - Genie: Show Me the Data for Quantization [2.7286395031146062]
We introduce a post-training quantization scheme for zero-shot quantization that produces high-quality quantized networks within a few hours.
We also propose a post-training quantization algorithm to enhance the performance of quantized models.
arXiv Detail & Related papers (2022-12-09T11:18:40Z) - SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian
Approximation [22.782678826199206]
Quantization of deep neural networks (DNN) has been proven effective for compressing and accelerating models.
Data-free quantization (DFQ) is a promising approach without the original datasets under privacy-sensitive and confidential scenarios.
This paper proposes an on-the-fly DFQ framework with sub-second quantization time, called SQuant, which can quantize networks on inference-only devices.
arXiv Detail & Related papers (2022-02-14T01:57:33Z) - Quantune: Post-training Quantization of Convolutional Neural Networks
using Extreme Gradient Boosting for Fast Deployment [15.720551497037176]
We propose an auto-tuner known as Quantune to accelerate the search for the configurations of quantization.
We show that Quantune reduces the search time for quantization by approximately 36.5x with an accuracy loss of 0.07 0.65% across six CNN models.
arXiv Detail & Related papers (2022-02-10T14:05:02Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.