Accelerating Neural Network Inference by Overflow Aware Quantization
- URL: http://arxiv.org/abs/2005.13297v1
- Date: Wed, 27 May 2020 11:56:22 GMT
- Title: Accelerating Neural Network Inference by Overflow Aware Quantization
- Authors: Hongwei Xie, Shuo Zhang, Huanghao Ding, Yafei Song, Baitao Shao,
Conggang Hu, Ling Cai and Mingyang Li
- Abstract summary: Inherited heavy computation of deep neural networks prevents their widespread applications.
We propose an overflow aware quantization method by designing trainable adaptive fixed-point representation.
With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance.
- Score: 16.673051600608535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The inherent heavy computation of deep neural networks prevents their
widespread applications. A widely used method for accelerating model inference
is quantization, by replacing the input operands of a network using fixed-point
values. Then the majority of computation costs focus on the integer matrix
multiplication accumulation. In fact, high-bit accumulator leads to partially
wasted computation and low-bit one typically suffers from numerical overflow.
To address this problem, we propose an overflow aware quantization method by
designing trainable adaptive fixed-point representation, to optimize the number
of bits for each input tensor while prohibiting numeric overflow during the
computation. With the proposed method, we are able to fully utilize the
computing power to minimize the quantization loss and obtain optimized
inference performance. To verify the effectiveness of our method, we conduct
image classification, object detection, and semantic segmentation tasks on
ImageNet, Pascal VOC, and COCO datasets, respectively. Experimental results
demonstrate that the proposed method can achieve comparable performance with
state-of-the-art quantization methods while accelerating the inference process
by about 2 times.
Related papers
- Towards Efficient Verification of Quantized Neural Networks [9.352320240912109]
Quantization replaces floating point arithmetic with integer arithmetic in deep neural network models.
We show how efficiency can be improved by utilizing gradient-based search methods and also bound-propagation techniques.
arXiv Detail & Related papers (2023-12-20T00:43:13Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - Quantized Proximal Averaging Network for Analysis Sparse Coding [23.080395291046408]
We unfold an iterative algorithm into a trainable network that facilitates learning sparsity prior to quantization.
We demonstrate applications to compressed image recovery and magnetic resonance image reconstruction.
arXiv Detail & Related papers (2021-05-13T12:05:35Z) - A Survey of Quantization Methods for Efficient Neural Network Inference [75.55159744950859]
quantization is the problem of distributing continuous real-valued numbers over a fixed discrete set of numbers to minimize the number of bits required.
It has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas.
Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x.
arXiv Detail & Related papers (2021-03-25T06:57:11Z) - Ps and Qs: Quantization-aware pruning for efficient low latency neural
network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications.
We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z) - DAQ: Distribution-Aware Quantization for Deep Image Super-Resolution
Networks [49.191062785007006]
Quantizing deep convolutional neural networks for image super-resolution substantially reduces their computational costs.
Existing works either suffer from a severe performance drop in ultra-low precision of 4 or lower bit-widths, or require a heavy fine-tuning process to recover the performance.
We propose a novel distribution-aware quantization scheme (DAQ) which facilitates accurate training-free quantization in ultra-low precision.
arXiv Detail & Related papers (2020-12-21T10:19:42Z) - WrapNet: Neural Net Inference with Ultra-Low-Resolution Arithmetic [57.07483440807549]
We propose a method that adapts neural networks to use low-resolution (8-bit) additions in the accumulators, achieving classification accuracy comparable to their 32-bit counterparts.
We demonstrate the efficacy of our approach on both software and hardware platforms.
arXiv Detail & Related papers (2020-07-26T23:18:38Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - Dithered backprop: A sparse and quantized backpropagation algorithm for
more efficient deep neural network training [18.27946970159625]
We propose a method for reducing the computational cost of backprop, which we named dithered backprop.
We show that our method is fully compatible to state-of-the-art training methods that reduce the bit-precision of training down to 8-bits.
arXiv Detail & Related papers (2020-04-09T17:59:26Z) - Post-Training Piecewise Linear Quantization for Deep Neural Networks [13.717228230596167]
Quantization plays an important role in the energy-efficient deployment of deep neural networks on resource-limited devices.
We propose a piecewise linear quantization scheme to enable accurate approximation for tensor values that have bell-shaped distributions with long tails.
Compared to state-of-the-art post-training quantization methods, our proposed method achieves superior performance on image classification, semantic segmentation, and object detection with minor overhead.
arXiv Detail & Related papers (2020-01-31T23:47:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.