LG-LSQ: Learned Gradient Linear Symmetric Quantization
- URL: http://arxiv.org/abs/2202.09009v1
- Date: Fri, 18 Feb 2022 03:38:12 GMT
- Title: LG-LSQ: Learned Gradient Linear Symmetric Quantization
- Authors: Shih-Ting Lin, Zhaofang Li, Yu-Hsiang Cheng, Hao-Wen Kuo, Chih-Cheng
Lu, Kea-Tiong Tang
- Abstract summary: Deep neural networks with lower precision weights have advantages in terms of the cost of memory space and accelerator power.
The main challenge associated with the quantization algorithm is maintaining accuracy at low bit-widths.
We propose learned gradient linear symmetric quantization (LG-LSQ) as a method for quantizing weights and activation functions to low bit-widths.
- Score: 3.6816597150770387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks with lower precision weights and operations at inference
time have advantages in terms of the cost of memory space and accelerator
power. The main challenge associated with the quantization algorithm is
maintaining accuracy at low bit-widths. We propose learned gradient linear
symmetric quantization (LG-LSQ) as a method for quantizing weights and
activation functions to low bit-widths with high accuracy in integer neural
network processors. First, we introduce the scaling simulated gradient (SSG)
method for determining the appropriate gradient for the scaling factor of the
linear quantizer during the training process. Second, we introduce the
arctangent soft round (ASR) method, which differs from the straight-through
estimator (STE) method in its ability to prevent the gradient from becoming
zero, thereby solving the discrete problem caused by the rounding process.
Finally, to bridge the gap between full-precision and low-bit quantization
networks, we propose the minimize discretization error (MDE) method to
determine an accurate gradient in backpropagation. The ASR+MDE method is a
simple alternative to the STE method and is practical for use in different
uniform quantization methods. In our evaluation, the proposed quantizer
achieved full-precision baseline accuracy in various 3-bit networks, including
ResNet18, ResNet34, and ResNet50, and an accuracy drop of less than 1% in the
quantization of 4-bit weights and 4-bit activations in lightweight models such
as MobileNetV2 and ShuffleNetV2.
Related papers
- Neural Precision Polarization: Simplifying Neural Network Inference with Dual-Level Precision [0.4124847249415279]
A floating-point model can be trained in the cloud and then downloaded to an edge device.
Network weights and activations are directly quantized to meet the edge devices' desired level, such as NF4 or INT8.
We show that neural precision polarization enables approximately 464 TOPS per Watt MAC efficiency and reliability.
arXiv Detail & Related papers (2024-11-06T16:02:55Z) - Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip [0.9187138676564589]
We present High Granularity Quantization (HGQ), an innovative quantization-aware training method.
HGQ fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent.
This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations.
arXiv Detail & Related papers (2024-05-01T17:18:46Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Automatic Network Adaptation for Ultra-Low Uniform-Precision
Quantization [6.1664476076961146]
Uniform-precision neural network quantization has gained popularity since it simplifies densely packed arithmetic unit for high computing capability.
It ignores heterogeneous sensitivity to the impact of quantization errors across the layers, resulting in sub-optimal inference.
This work proposes a novel neural architecture search called neural channel expansion that adjusts the network structure to alleviate accuracy degradation from ultra-low uniform-precision quantization.
arXiv Detail & Related papers (2022-12-21T09:41:25Z) - Edge Inference with Fully Differentiable Quantized Mixed Precision
Neural Networks [1.131071436917293]
Quantizing parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference.
This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing.
arXiv Detail & Related papers (2022-06-15T18:11:37Z) - Mixed Precision Low-bit Quantization of Neural Network Language Models
for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications.
Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors.
Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z) - Mixed Precision of Quantization of Transformer Language Models for
Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications.
Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors.
The optimal local precision settings are automatically learned using two techniques.
Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z) - Direct Quantization for Training Highly Accurate Low Bit-width Deep
Neural Networks [73.29587731448345]
This paper proposes two novel techniques to train deep convolutional neural networks with low bit-width weights and activations.
First, to obtain low bit-width weights, most existing methods obtain the quantized weights by performing quantization on the full-precision network weights.
Second, to obtain low bit-width activations, existing works consider all channels equally.
arXiv Detail & Related papers (2020-12-26T15:21:18Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - A Statistical Framework for Low-bitwidth Training of Deep Neural
Networks [70.77754244060384]
Fully quantized training (FQT) uses low-bitwidth hardware by quantizing the activations, weights, and gradients of a neural network model.
One major challenge with FQT is the lack of theoretical understanding, in particular of how gradient quantization impacts convergence properties.
arXiv Detail & Related papers (2020-10-27T13:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.