DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural
Network Inference
- URL: http://arxiv.org/abs/2302.12510v1
- Date: Fri, 24 Feb 2023 08:46:01 GMT
- Title: DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural
Network Inference
- Authors: Jiajun Zhou, Jiajun Wu, Yizhao Gao, Yuhao Ding, Chaofan Tao, Boyu Li,
Fengbin Tu, Kwang-Ting Cheng, Hayden Kwok-Hay So and Ngai Wong
- Abstract summary: This work targets an adaptive data representation with variable-length encoding called DyBit.
We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup.
Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization.
- Score: 28.912023025671868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To accelerate the inference of deep neural networks (DNNs), quantization with
low-bitwidth numbers is actively researched. A prominent challenge is to
quantize the DNN models into low-bitwidth numbers without significant accuracy
degradation, especially at very low bitwidths (< 8 bits). This work targets an
adaptive data representation with variable-length encoding called DyBit. DyBit
can dynamically adjust the precision and range of separate bit-field to be
adapted to the DNN weights/activations distribution. We also propose a
hardware-aware quantization framework with a mixed-precision accelerator to
trade-off the inference accuracy and speedup. Experimental results demonstrate
that the inference accuracy via DyBit is 1.997% higher than the
state-of-the-art at 4-bit quantization, and the proposed framework can achieve
up to 8.1x speedup compared with the original model.
Related papers
- FxP-QNet: A Post-Training Quantizer for the Design of Mixed
Low-Precision DNNs with Dynamic Fixed-Point Representation [2.4149105714758545]
We propose a novel framework referred to as the Fixed-Point Quantizer of deep neural Networks (FxP-QNet)
FxP-QNet adapts the quantization level for each data-structure of each layer based on the trade-off between the network accuracy and the low-precision requirements.
Results show that FxP-QNet-quantized AlexNet, VGG-16, and ResNet-18 reduce the overall memory requirements of their full-precision counterparts by 7.16x, 10.36x, and 6.44x with less than 0.95%, 0.95%, and 1.99%
arXiv Detail & Related papers (2022-03-22T23:01:43Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Subtensor Quantization for Mobilenets [5.735035463793008]
Quantization for deep neural networks (DNN) have enabled developers to deploy models with less memory and more efficient low-power inference.
In this paper, we analyzed several root causes of quantization loss and proposed alternatives that do not rely on per-channel or training-aware approaches.
We evaluate the image classification task on ImageNet dataset, and our post-training quantized 8-bit inference top-1 accuracy in within 0.7% of the floating point version.
arXiv Detail & Related papers (2020-11-04T15:41:47Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - Fast Implementation of 4-bit Convolutional Neural Networks for Mobile
Devices [0.8362190332905524]
We show an efficient implementation of 4-bit matrix multiplication for quantized neural networks.
We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset.
The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.
arXiv Detail & Related papers (2020-09-14T14:48:40Z) - FATNN: Fast and Accurate Ternary Neural Networks [89.07796377047619]
Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts.
In this work, we show that, under some mild constraints, computational complexity of the ternary inner product can be reduced by a factor of 2.
We elaborately design an implementation-dependent ternary quantization algorithm to mitigate the performance gap.
arXiv Detail & Related papers (2020-08-12T04:26:18Z) - Quantized Neural Network Inference with Precision Batching [4.519884877213097]
Precision decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations.
Precision yields end-to-endups of over 8x on a GPU within a 1% error margin of the full precision baseline.
Across a variety of applications, Precision yields end-to-endups of over 8x on a GPU within a 1% error margin of the full precision baseline.
arXiv Detail & Related papers (2020-02-26T19:34:11Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z) - Shifted and Squeezed 8-bit Floating Point format for Low-Precision
Training of Deep Neural Networks [13.929168096016957]
We introduce a novel methodology for training deep neural networks using 8-bit floating point (FP8) numbers.
Reduced bit precision allows for a larger effective memory and increased computational speed.
We show that, unlike previous 8-bit precision training methods, the proposed method works out-of-the-box for representative models.
arXiv Detail & Related papers (2020-01-16T06:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.