A Survey of Quantization Methods for Efficient Neural Network Inference
- URL: http://arxiv.org/abs/2103.13630v1
- Date: Thu, 25 Mar 2021 06:57:11 GMT
- Title: A Survey of Quantization Methods for Efficient Neural Network Inference
- Authors: Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney,
Kurt Keutzer
- Abstract summary: quantization is the problem of distributing continuous real-valued numbers over a fixed discrete set of numbers to minimize the number of bits required.
It has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas.
Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x.
- Score: 75.55159744950859
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As soon as abstract mathematical computations were adapted to computation on
digital computers, the problem of efficient representation, manipulation, and
communication of the numerical values in those computations arose. Strongly
related to the problem of numerical representation is the problem of
quantization: in what manner should a set of continuous real-valued numbers be
distributed over a fixed discrete set of numbers to minimize the number of bits
required and also to maximize the accuracy of the attendant computations? This
perennial problem of quantization is particularly relevant whenever memory
and/or computational resources are severely restricted, and it has come to the
forefront in recent years due to the remarkable performance of Neural Network
models in computer vision, natural language processing, and related areas.
Moving from floating-point representations to low-precision fixed integer
values represented in four bits or less holds the potential to reduce the
memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x
to 8x are often realized in practice in these applications. Thus, it is not
surprising that quantization has emerged recently as an important and very
active sub-area of research in the efficient implementation of computations
associated with Neural Networks. In this article, we survey approaches to the
problem of quantizing the numerical values in deep Neural Network computations,
covering the advantages/disadvantages of current methods. With this survey and
its organization, we hope to have presented a useful snapshot of the current
research in quantization for Neural Networks and to have given an intelligent
organization to ease the evaluation of future research in this area.
Related papers
- Constraint Guided Model Quantization of Neural Networks [0.0]
Constraint Guided Model Quantization (CGMQ) is a quantization aware training algorithm that uses an upper bound on the computational resources and reduces the bit-widths of the parameters of the neural network.
It is shown on MNIST that the performance of CGMQ is competitive with state-of-the-art quantization aware training algorithms.
arXiv Detail & Related papers (2024-09-30T09:41:16Z) - Low Precision Quantization-aware Training in Spiking Neural Networks
with Differentiable Quantization Function [0.5046831208137847]
This work aims to bridge the gap between recent progress in quantized neural networks and spiking neural networks.
It presents an extensive study on the performance of the quantization function, represented as a linear combination of sigmoid functions.
The presented quantization function demonstrates the state-of-the-art performance on four popular benchmarks.
arXiv Detail & Related papers (2023-05-30T09:42:05Z) - Fast Exploration of the Impact of Precision Reduction on Spiking Neural
Networks [63.614519238823206]
Spiking Neural Networks (SNNs) are a practical choice when the target hardware reaches the edge of computing.
We employ an Interval Arithmetic (IA) model to develop an exploration methodology that takes advantage of the capability of such a model to propagate the approximation error.
arXiv Detail & Related papers (2022-11-22T15:08:05Z) - Low-bit Shift Network for End-to-End Spoken Language Understanding [7.851607739211987]
We propose the use of power-of-two quantization, which quantizes continuous parameters into low-bit power-of-two values.
This reduces computational complexity by removing expensive multiplication operations and with the use of low-bit weights.
arXiv Detail & Related papers (2022-07-15T14:34:22Z) - SignalNet: A Low Resolution Sinusoid Decomposition and Estimation
Network [79.04274563889548]
We propose SignalNet, a neural network architecture that detects the number of sinusoids and estimates their parameters from quantized in-phase and quadrature samples.
We introduce a worst-case learning threshold for comparing the results of our network relative to the underlying data distributions.
In simulation, we find that our algorithm is always able to surpass the threshold for three-bit data but often cannot exceed the threshold for one-bit data.
arXiv Detail & Related papers (2021-06-10T04:21:20Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - Accelerating Neural Network Inference by Overflow Aware Quantization [16.673051600608535]
Inherited heavy computation of deep neural networks prevents their widespread applications.
We propose an overflow aware quantization method by designing trainable adaptive fixed-point representation.
With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance.
arXiv Detail & Related papers (2020-05-27T11:56:22Z) - Integer Quantization for Deep Learning Inference: Principles and
Empirical Evaluation [4.638764944415326]
Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput.
We focus on quantization techniques that are amenable to acceleration by processors with high- throughput integer math pipelines.
We present a workflow for 8-bit quantization that is able to maintain accuracy within 1% of the floating-point baseline on all networks studied.
arXiv Detail & Related papers (2020-04-20T19:59:22Z) - Binary Neural Networks: A Survey [126.67799882857656]
The binary neural network serves as a promising technique for deploying deep models on resource-limited devices.
The binarization inevitably causes severe information loss, and even worse, its discontinuity brings difficulty to the optimization of the deep network.
We present a survey of these algorithms, mainly categorized into the native solutions directly conducting binarization, and the optimized ones using techniques like minimizing the quantization error, improving the network loss function, and reducing the gradient error.
arXiv Detail & Related papers (2020-03-31T16:47:20Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.