Development of Quantized DNN Library for Exact Hardware Emulation
- URL: http://arxiv.org/abs/2106.08892v1
- Date: Tue, 15 Jun 2021 17:42:40 GMT
- Title: Development of Quantized DNN Library for Exact Hardware Emulation
- Authors: Masato Kiyama and Motoki Amagasaki and Masahiro Iida
- Abstract summary: Quantization is used to speed up execution time and save power when runnning Deep neural networks (DNNs) on edge devices like AI chips.
We have developed PyParch, a library that executes quantized DNNs with exactly the same be havior as hardware.
- Score: 0.17188280334580192
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantization is used to speed up execution time and save power when runnning
Deep neural networks (DNNs) on edge devices like AI chips. To investigate the
effect of quantization, we need performing inference after quantizing the
weights of DNN with 32-bit floating-point precision by a some bit width, and
then quantizing them back to 32-bit floating-point precision. This is because
the DNN library can only handle floating-point numbers. However, the accuracy
of the emulation does not provide accurate precision. We need accurate
precision to detect overflow in MAC operations or to verify the operation on
edge de vices. We have developed PyParch, a DNN library that executes quantized
DNNs (QNNs) with exactly the same be havior as hardware. In this paper, we
describe a new proposal and implementation of PyParch. As a result of the
evaluation, the accuracy of QNNs with arbitrary bit widths can be estimated for
la rge and complex DNNs such as YOLOv5, and the overflow can be detected. We
evaluated the overhead of the emulation time and found that it was 5.6 times
slower for QNN and 42
times slower for QNN with overflow detection compared to the normal DNN
execution time.
Related papers
- Starting Positions Matter: A Study on Better Weight Initialization for Neural Network Quantization [71.44469196328507]
Quantization-specific model development techniques such as regularization, quantization-aware training, and quantization-robustness penalties have served to greatly boost the accuracy and robustness of modern DNNs.<n>We present an extensive study examining the effects of different weight initializations on a variety of CNN building blocks commonly used in efficient CNNs.<n>Next, we explore a new method for quantization-robust CNN initialization -- using Graph Hypernetworks (GHN) to predict parameters of quantized DNNs.
arXiv Detail & Related papers (2025-06-12T08:11:34Z) - A Converting Autoencoder Toward Low-latency and Energy-efficient DNN
Inference at the Edge [4.11949030493552]
We present CBNet, a low-latency and energy-efficient deep neural network (DNN) inference framework tailored for edge devices.
It utilizes a "converting" autoencoder to efficiently transform hard images into easy ones.
CBNet achieves up to 4.8x speedup in inference latency and 79% reduction in energy usage compared to competing techniques.
arXiv Detail & Related papers (2024-03-11T08:13:42Z) - DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural
Network Inference [28.912023025671868]
This work targets an adaptive data representation with variable-length encoding called DyBit.
We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup.
Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization.
arXiv Detail & Related papers (2023-02-24T08:46:01Z) - Attention-based Feature Compression for CNN Inference Offloading in Edge
Computing [93.67044879636093]
This paper studies the computational offloading of CNN inference in device-edge co-inference systems.
We propose a novel autoencoder-based CNN architecture (AECNN) for effective feature extraction at end-device.
Experiments show that AECNN can compress the intermediate data by more than 256x with only about 4% accuracy loss.
arXiv Detail & Related papers (2022-11-24T18:10:01Z) - Automated machine learning for borehole resistivity measurements [0.0]
Deep neural networks (DNNs) offer a real-time solution for the inversion of borehole resistivity measurements.
It is possible to use extremely large DNNs to approximate the operators, but it demands a considerable training time.
In this work, we propose a scoring function that accounts for the accuracy and size of the DNNs.
arXiv Detail & Related papers (2022-07-20T12:27:22Z) - PocketNN: Integer-only Training and Inference of Neural Networks via
Direct Feedback Alignment and Pocket Activations in Pure C++ [10.508187462682308]
Deep learning algorithms are implemented using floating-point real numbers.
This presents an obstacle for implementing them on low-end devices which may not have dedicated floating-point units (FPUs)
arXiv Detail & Related papers (2022-01-08T16:52:34Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - FATNN: Fast and Accurate Ternary Neural Networks [89.07796377047619]
Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts.
In this work, we show that, under some mild constraints, computational complexity of the ternary inner product can be reduced by a factor of 2.
We elaborately design an implementation-dependent ternary quantization algorithm to mitigate the performance gap.
arXiv Detail & Related papers (2020-08-12T04:26:18Z) - AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation.
Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z) - A Framework for Semi-Automatic Precision and Accuracy Analysis for Fast
and Rigorous Deep Learning [1.5863809575305419]
Many papers experimentally observe that Deep Neural Networks (DNNs) can successfully run at almost ridiculously low precision.
This paper sheds some theoretical light upon why a DNN's FP accuracy stays high for low FP precision.
We present a software framework for FP error analysis for the inference phase of deep-learning.
arXiv Detail & Related papers (2020-02-10T15:33:19Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.