ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural
Network Quantization
- URL: http://arxiv.org/abs/2208.14286v1
- Date: Tue, 30 Aug 2022 14:12:49 GMT
- Title: ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural
Network Quantization
- Authors: Cong Guo, Chen Zhang, Jingwen Leng, Zihan Liu, Fan Yang, Yunxin Liu,
Minyi Guo, Yuhao Zhu
- Abstract summary: We propose a fixed-length adaptive numerical data type called ANT to achieve low-bit quantization with tiny hardware overheads.
Our design results in 2.8$times$ speedup and 2.5$times$ energy efficiency improvement over the state-of-the-art quantization accelerators.
- Score: 31.494669469303954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quantization is a technique to reduce the computation and memory cost of DNN
models, which are getting increasingly large. Existing quantization solutions
use fixed-point integer or floating-point types, which have limited benefits,
as both require more bits to maintain the accuracy of original models. On the
other hand, variable-length quantization uses low-bit quantization for normal
values and high-precision for a fraction of outlier values. Even though this
line of work brings algorithmic benefits, it also introduces significant
hardware overheads due to variable-length encoding and decoding.
In this work, we propose a fixed-length adaptive numerical data type called
ANT to achieve low-bit quantization with tiny hardware overheads. Our data type
ANT leverages two key innovations to exploit the intra-tensor and inter-tensor
adaptive opportunities in DNN models. First, we propose a particular data type,
flint, that combines the advantages of float and int for adapting to the
importance of different values within a tensor. Second, we propose an adaptive
framework that selects the best type for each tensor according to its
distribution characteristics. We design a unified processing element
architecture for ANT and show its ease of integration with existing DNN
accelerators. Our design results in 2.8$\times$ speedup and 2.5$\times$ energy
efficiency improvement over the state-of-the-art quantization accelerators.
Related papers
- Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference [4.093167352780157]
We introduce Logarithmic Posits (LP), an adaptive, hardware-friendly data type inspired by posits.
We also develop a novel genetic-algorithm based framework, LP Quantization (LPQ), to find optimal layer-wise LP parameters.
arXiv Detail & Related papers (2024-03-08T17:28:49Z) - Incrementally-Computable Neural Networks: Efficient Inference for
Dynamic Inputs [75.40636935415601]
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs.
We take an incremental computing approach, looking to reuse calculations as the inputs change.
We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of modified inputs.
arXiv Detail & Related papers (2023-07-27T16:30:27Z) - Recurrent Bilinear Optimization for Binary Neural Networks [58.972212365275595]
BNNs neglect the intrinsic bilinear relationship of real-valued weights and scale factors.
Our work is the first attempt to optimize BNNs from the bilinear perspective.
We obtain robust RBONNs, which show impressive performance over state-of-the-art BNNs on various models and datasets.
arXiv Detail & Related papers (2022-09-04T06:45:33Z) - Edge Inference with Fully Differentiable Quantized Mixed Precision
Neural Networks [1.131071436917293]
Quantizing parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference.
This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing.
arXiv Detail & Related papers (2022-06-15T18:11:37Z) - A Comprehensive Survey on Model Quantization for Deep Neural Networks in
Image Classification [0.0]
A promising approach is quantization, in which the full-precision values are stored in low bit-width precision.
We present a comprehensive survey of quantization concepts and methods, with a focus on image classification.
We explain the replacement of floating-point operations with low-cost bitwise operations in a quantized DNN and the sensitivity of different layers in quantization.
arXiv Detail & Related papers (2022-05-14T15:08:32Z) - REx: Data-Free Residual Quantization Error Expansion [32.87131159997359]
Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost.
With the growing concerns on privacy rights, we focus our efforts on data-free methods.
We propose REx, a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization.
arXiv Detail & Related papers (2022-03-28T11:04:45Z) - ECQ$^{\text{x}}$: Explainability-Driven Quantization for Low-Bit and
Sparse DNNs [13.446502051609036]
We develop and describe a novel quantization paradigm for deep neural networks (DNNs)
Our method leverages concepts of explainable AI (XAI) and concepts of information theory.
The ultimate goal is to preserve the most relevant weights in quantization clusters of highest information content.
arXiv Detail & Related papers (2021-09-09T12:57:06Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.