Related papers: Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks

Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks

URL: http://arxiv.org/abs/2203.05025v1
Date: Wed, 9 Mar 2022 19:57:14 GMT
Title: Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks
Authors: Dominika Przewlocka-Rus, Syed Shakib Sarwar, H. Ekin Sumbul, Yuecheng Li, Barbara De Salvo
Abstract summary: In this paper, we explore non-linear quantization techniques for exploiting lower bit precision. We developed the Quantization Aware Training (QAT) algorithm that allowed training of low bit width Power-of-Two (PoT) networks. At the same time, PoT quantization vastly reduces the computational complexity of the neural network.
Score: 1.398698203665363
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deploying Deep Neural Networks in low-power embedded devices for real time-constrained applications requires optimization of memory and computational complexity of the networks, usually by quantizing the weights. Most of the existing works employ linear quantization which causes considerable degradation in accuracy for weight bit widths lower than 8. Since the distribution of weights is usually non-uniform (with most weights concentrated around zero), other methods, such as logarithmic quantization, are more suitable as they are able to preserve the shape of the weight distribution more precise. Moreover, using base-2 logarithmic representation allows optimizing the multiplication by replacing it with bit shifting. In this paper, we explore non-linear quantization techniques for exploiting lower bit precision and identify favorable hardware implementation options. We developed the Quantization Aware Training (QAT) algorithm that allowed training of low bit width Power-of-Two (PoT) networks and achieved accuracies on par with state-of-the-art floating point models for different tasks. We explored PoT weight encoding techniques and investigated hardware designs of MAC units for three different quantization schemes - uniform, PoT and Additive-PoT (APoT) - to show the increased efficiency when using the proposed approach. Eventually, the experiments showed that for low bit width precision, non-uniform quantization performs better than uniform, and at the same time, PoT quantization vastly reduces the computational complexity of the neural network.

Related papers

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search [7.564770908909927]
Quantization is a technique for creating efficient Deep Neural Networks (DNNs) We propose MixQuant, a search algorithm that finds the optimal custom quantization bit-width for each layer weight based on roundoff error. We show that combining MixQuant with BRECQ, a state-of-the-art quantization method, yields better quantized model accuracy than BRECQ alone.
arXiv Detail & Related papers (2023-09-29T15:49:54Z)
On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices. For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator. For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z)
A Practical Mixed Precision Algorithm for Post-Training Quantization [15.391257986051249]
Mixed-precision quantization is a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. We present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset. We show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
arXiv Detail & Related papers (2023-02-10T17:47:54Z)
Vertical Layering of Quantized Neural Networks for Heterogeneous Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z)
Energy Efficient Hardware Acceleration of Neural Networks with Power-of-Two Quantisation [0.0]
We show that a hardware neural network accelerator with PoT weights implemented on the Zynq UltraScale + MPSoC ZCU104 FPGA can be at least $1.4x$ more energy efficient than the uniform quantisation version.
arXiv Detail & Related papers (2022-09-30T06:33:40Z)
Post-training Quantization for Neural Networks with Provable Guarantees [9.58246628652846]
We modify a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism. We prove that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights.
arXiv Detail & Related papers (2022-01-26T18:47:38Z)
Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z)
Mixed Precision of Quantization of Transformer Language Models for Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications. Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors. The optimal local precision settings are automatically learned using two techniques. Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z)
DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs. Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance. We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z)
Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators. We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z)
AUSN: Approximately Uniform Quantization by Adaptively Superimposing Non-uniform Distribution for Deep Neural Networks [0.7378164273177589]
Existing uniform and non-uniform quantization methods exhibit an inherent conflict between the representing range and representing resolution. We propose a novel quantization method to quantize the weight and activation. The key idea is to Approximate the Uniform quantization by Adaptively Superposing multiple Non-uniform quantized values, namely AUSN.
arXiv Detail & Related papers (2020-07-08T05:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.