Automatic Network Adaptation for Ultra-Low Uniform-Precision
Quantization
- URL: http://arxiv.org/abs/2212.10878v3
- Date: Wed, 29 Mar 2023 07:45:01 GMT
- Title: Automatic Network Adaptation for Ultra-Low Uniform-Precision
Quantization
- Authors: Seongmin Park, Beomseok Kwon, Jieun Lim, Kyuyoung Sim, Tae-Ho Kim and
Jungwook Choi
- Abstract summary: Uniform-precision neural network quantization has gained popularity since it simplifies densely packed arithmetic unit for high computing capability.
It ignores heterogeneous sensitivity to the impact of quantization errors across the layers, resulting in sub-optimal inference.
This work proposes a novel neural architecture search called neural channel expansion that adjusts the network structure to alleviate accuracy degradation from ultra-low uniform-precision quantization.
- Score: 6.1664476076961146
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Uniform-precision neural network quantization has gained popularity since it
simplifies densely packed arithmetic unit for high computing capability.
However, it ignores heterogeneous sensitivity to the impact of quantization
errors across the layers, resulting in sub-optimal inference accuracy. This
work proposes a novel neural architecture search called neural channel
expansion that adjusts the network structure to alleviate accuracy degradation
from ultra-low uniform-precision quantization. The proposed method selectively
expands channels for the quantization sensitive layers while satisfying
hardware constraints (e.g., FLOPs, PARAMs). Based on in-depth analysis and
experiments, we demonstrate that the proposed method can adapt several popular
networks channels to achieve superior 2-bit quantization accuracy on CIFAR10
and ImageNet. In particular, we achieve the best-to-date Top-1/Top-5 accuracy
for 2-bit ResNet50 with smaller FLOPs and the parameter size.
Related papers
- Three Quantization Regimes for ReLU Networks [3.823356975862005]
We establish the fundamental limits in the approximation of Lipschitz functions by deep ReLU neural networks with finite-precision weights.
In the proper-quantization regime, neural networks exhibit memory-optimality in the approximation of Lipschitz functions.
arXiv Detail & Related papers (2024-05-03T09:27:31Z) - Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip [0.9187138676564589]
We present High Granularity Quantization (HGQ), an innovative quantization-aware training method.
HGQ fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent.
This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations.
arXiv Detail & Related papers (2024-05-01T17:18:46Z) - Mixed-Precision Quantization with Cross-Layer Dependencies [6.338965603383983]
Mixed-precision quantization (MPQ) assigns varied bit-widths to layers to optimize the accuracy-efficiency trade-off.
Existing methods simplify the MPQ problem by assuming that quantization errors at different layers act independently.
We show that this assumption does not reflect the true behavior of quantized deep neural networks.
arXiv Detail & Related papers (2023-07-11T15:56:00Z) - Efficient and Effective Methods for Mixed Precision Neural Network
Quantization for Faster, Energy-efficient Inference [3.3213055774512648]
Quantizing networks to lower precision is a powerful technique for simplifying networks.
Mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance.
To estimate the impact of layer precision choice on task performance, two methods are introduced.
Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers.
arXiv Detail & Related papers (2023-01-30T23:26:33Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - Direct Quantization for Training Highly Accurate Low Bit-width Deep
Neural Networks [73.29587731448345]
This paper proposes two novel techniques to train deep convolutional neural networks with low bit-width weights and activations.
First, to obtain low bit-width weights, most existing methods obtain the quantized weights by performing quantization on the full-precision network weights.
Second, to obtain low bit-width activations, existing works consider all channels equally.
arXiv Detail & Related papers (2020-12-26T15:21:18Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - Automatic heterogeneous quantization of deep neural networks for
low-latency inference on the edge for particle detectors [5.609098985493794]
We introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip.
This is crucial for the event selection procedure in proton-proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of $mathcal O(1)mu$s is required.
arXiv Detail & Related papers (2020-06-15T15:07:49Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.