Pareto-Optimal Quantized ResNet Is Mostly 4-bit
- URL: http://arxiv.org/abs/2105.03536v1
- Date: Fri, 7 May 2021 23:28:37 GMT
- Title: Pareto-Optimal Quantized ResNet Is Mostly 4-bit
- Authors: AmirAli Abdolrashidi, Lisa Wang, Shivani Agrawal, Jonathan Malmaud,
Oleg Rybakov, Chas Leichner, Lukasz Lew
- Abstract summary: We use ResNet as a case study to investigate the effects of quantization on inference compute cost-quality tradeoff curves.
Our results suggest that for each bfloat16 ResNet model, there are quantized models with lower cost and higher accuracy.
We achieve state-of-the-art results on ImageNet for 4-bit ResNet-50 with quantization-aware training, obtaining a top-1 eval accuracy of 77.09%.
- Score: 3.83996783171716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization has become a popular technique to compress neural networks and
reduce compute cost, but most prior work focuses on studying quantization
without changing the network size. Many real-world applications of neural
networks have compute cost and memory budgets, which can be traded off with
model quality by changing the number of parameters. In this work, we use ResNet
as a case study to systematically investigate the effects of quantization on
inference compute cost-quality tradeoff curves. Our results suggest that for
each bfloat16 ResNet model, there are quantized models with lower cost and
higher accuracy; in other words, the bfloat16 compute cost-quality tradeoff
curve is Pareto-dominated by the 4-bit and 8-bit curves, with models primarily
quantized to 4-bit yielding the best Pareto curve. Furthermore, we achieve
state-of-the-art results on ImageNet for 4-bit ResNet-50 with
quantization-aware training, obtaining a top-1 eval accuracy of 77.09%. We
demonstrate the regularizing effect of quantization by measuring the
generalization gap. The quantization method we used is optimized for
practicality: It requires little tuning and is designed with hardware
capabilities in mind. Our work motivates further research into optimal numeric
formats for quantization, as well as the development of machine learning
accelerators supporting these formats. As part of this work, we contribute a
quantization library written in JAX, which is open-sourced at
https://github.com/google-research/google-research/tree/master/aqt.
Related papers
- ISQuant: apply squant to the real deployment [0.0]
We analyze why the combination of quantization and dequantization is used to train the model.
We propose ISQuant as a solution for deploying 8-bit models.
arXiv Detail & Related papers (2024-07-05T15:10:05Z) - CEG4N: Counter-Example Guided Neural Network Quantization Refinement [2.722899166098862]
We propose Counter-Example Guided Neural Network Quantization Refinement (CEG4N)
This technique combines search-based quantization and equivalence verification.
We produce models with up to 72% better accuracy than state-of-the-art techniques.
arXiv Detail & Related papers (2022-07-09T09:25:45Z) - Post-training Quantization for Neural Networks with Provable Guarantees [9.58246628652846]
We modify a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism.
We prove that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights.
arXiv Detail & Related papers (2022-01-26T18:47:38Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - One Model for All Quantization: A Quantized Network Supporting Hot-Swap
Bit-Width Adjustment [36.75157407486302]
We propose a method to train a model for all quantization that supports diverse bit-widths.
We use wavelet decomposition and reconstruction to increase the diversity of weights.
Our method can achieve accuracy comparable to dedicated models trained at the same precision.
arXiv Detail & Related papers (2021-05-04T08:10:50Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Subtensor Quantization for Mobilenets [5.735035463793008]
Quantization for deep neural networks (DNN) have enabled developers to deploy models with less memory and more efficient low-power inference.
In this paper, we analyzed several root causes of quantization loss and proposed alternatives that do not rely on per-channel or training-aware approaches.
We evaluate the image classification task on ImageNet dataset, and our post-training quantized 8-bit inference top-1 accuracy in within 0.7% of the floating point version.
arXiv Detail & Related papers (2020-11-04T15:41:47Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z) - Leveraging Automated Mixed-Low-Precision Quantization for tiny edge
microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices.
Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors.
Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.