FxP-QNet: A Post-Training Quantizer for the Design of Mixed
Low-Precision DNNs with Dynamic Fixed-Point Representation
- URL: http://arxiv.org/abs/2203.12091v1
- Date: Tue, 22 Mar 2022 23:01:43 GMT
- Title: FxP-QNet: A Post-Training Quantizer for the Design of Mixed
Low-Precision DNNs with Dynamic Fixed-Point Representation
- Authors: Ahmad Shawahna, Sadiq M. Sait, Aiman El-Maleh, and Irfan Ahmad
- Abstract summary: We propose a novel framework referred to as the Fixed-Point Quantizer of deep neural Networks (FxP-QNet)
FxP-QNet adapts the quantization level for each data-structure of each layer based on the trade-off between the network accuracy and the low-precision requirements.
Results show that FxP-QNet-quantized AlexNet, VGG-16, and ResNet-18 reduce the overall memory requirements of their full-precision counterparts by 7.16x, 10.36x, and 6.44x with less than 0.95%, 0.95%, and 1.99%
- Score: 2.4149105714758545
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks (DNNs) have demonstrated their effectiveness in a wide
range of computer vision tasks, with the state-of-the-art results obtained
through complex and deep structures that require intensive computation and
memory. Now-a-days, efficient model inference is crucial for consumer
applications on resource-constrained platforms. As a result, there is much
interest in the research and development of dedicated deep learning (DL)
hardware to improve the throughput and energy efficiency of DNNs. Low-precision
representation of DNN data-structures through quantization would bring great
benefits to specialized DL hardware. However, the rigorous quantization leads
to a severe accuracy drop. As such, quantization opens a large hyper-parameter
space at bit-precision levels, the exploration of which is a major challenge.
In this paper, we propose a novel framework referred to as the Fixed-Point
Quantizer of deep neural Networks (FxP-QNet) that flexibly designs a mixed
low-precision DNN for integer-arithmetic-only deployment. Specifically, the
FxP-QNet gradually adapts the quantization level for each data-structure of
each layer based on the trade-off between the network accuracy and the
low-precision requirements. Additionally, it employs post-training
self-distillation and network prediction error statistics to optimize the
quantization of floating-point values into fixed-point numbers. Examining
FxP-QNet on state-of-the-art architectures and the benchmark ImageNet dataset,
we empirically demonstrate the effectiveness of FxP-QNet in achieving the
accuracy-compression trade-off without the need for training. The results show
that FxP-QNet-quantized AlexNet, VGG-16, and ResNet-18 reduce the overall
memory requirements of their full-precision counterparts by 7.16x, 10.36x, and
6.44x with less than 0.95%, 0.95%, and 1.99% accuracy drop, respectively.
Related papers
- On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - Design of High-Throughput Mixed-Precision CNN Accelerators on FPGA [0.0]
Layer-wise mixed-precision quantization allows for more efficient results while inflating the design space.
We present an in-depth quantitative methodology to efficiently explore the design space considering the limited hardware resources of a given FPGA.
Our resulting hardware accelerators implement truly mixed-precision operations that enable efficient execution of layer-wise and channel-wise quantized CNNs.
arXiv Detail & Related papers (2022-08-09T15:32:51Z) - Green, Quantized Federated Learning over Wireless Networks: An
Energy-Efficient Design [68.86220939532373]
The finite precision level is captured through the use of quantized neural networks (QNNs) that quantize weights and activations in fixed-precision format.
The proposed FL framework can reduce energy consumption until convergence by up to 70% compared to a baseline FL algorithm.
arXiv Detail & Related papers (2022-07-19T16:37:24Z) - Low-bit Shift Network for End-to-End Spoken Language Understanding [7.851607739211987]
We propose the use of power-of-two quantization, which quantizes continuous parameters into low-bit power-of-two values.
This reduces computational complexity by removing expensive multiplication operations and with the use of low-bit weights.
arXiv Detail & Related papers (2022-07-15T14:34:22Z) - Edge Inference with Fully Differentiable Quantized Mixed Precision
Neural Networks [1.131071436917293]
Quantizing parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference.
This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing.
arXiv Detail & Related papers (2022-06-15T18:11:37Z) - On the Tradeoff between Energy, Precision, and Accuracy in Federated
Quantized Neural Networks [68.52621234990728]
Federated learning (FL) over wireless networks requires balancing between accuracy, energy efficiency, and precision.
We propose a quantized FL framework that represents data with a finite level of precision in both local training and uplink transmission.
Our framework can reduce energy consumption by up to 53% compared to a standard FL model.
arXiv Detail & Related papers (2021-11-15T17:00:03Z) - Subtensor Quantization for Mobilenets [5.735035463793008]
Quantization for deep neural networks (DNN) have enabled developers to deploy models with less memory and more efficient low-power inference.
In this paper, we analyzed several root causes of quantization loss and proposed alternatives that do not rely on per-channel or training-aware approaches.
We evaluate the image classification task on ImageNet dataset, and our post-training quantized 8-bit inference top-1 accuracy in within 0.7% of the floating point version.
arXiv Detail & Related papers (2020-11-04T15:41:47Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - APQ: Joint Search for Network Architecture, Pruning and Quantization
Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware.
Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner.
With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z) - Widening and Squeezing: Towards Accurate and Efficient QNNs [125.172220129257]
Quantization neural networks (QNNs) are very attractive to the industry because their extremely cheap calculation and storage overhead, but their performance is still worse than that of networks with full-precision parameters.
Most of existing methods aim to enhance performance of QNNs especially binary neural networks by exploiting more effective training techniques.
We address this problem by projecting features in original full-precision networks to high-dimensional quantization features.
arXiv Detail & Related papers (2020-02-03T04:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.