Related papers: Development of Quantized DNN Library for Exact Hardware Emulation

Development of Quantized DNN Library for Exact Hardware Emulation

URL: http://arxiv.org/abs/2106.08892v1
Date: Tue, 15 Jun 2021 17:42:40 GMT
Title: Development of Quantized DNN Library for Exact Hardware Emulation
Authors: Masato Kiyama and Motoki Amagasaki and Masahiro Iida
Abstract summary: Quantization is used to speed up execution time and save power when runnning Deep neural networks (DNNs) on edge devices like AI chips. We have developed PyParch, a library that executes quantized DNNs with exactly the same be havior as hardware.
Score: 0.17188280334580192
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantization is used to speed up execution time and save power when runnning Deep neural networks (DNNs) on edge devices like AI chips. To investigate the effect of quantization, we need performing inference after quantizing the weights of DNN with 32-bit floating-point precision by a some bit width, and then quantizing them back to 32-bit floating-point precision. This is because the DNN library can only handle floating-point numbers. However, the accuracy of the emulation does not provide accurate precision. We need accurate precision to detect overflow in MAC operations or to verify the operation on edge de vices. We have developed PyParch, a DNN library that executes quantized DNNs (QNNs) with exactly the same be havior as hardware. In this paper, we describe a new proposal and implementation of PyParch. As a result of the evaluation, the accuracy of QNNs with arbitrary bit widths can be estimated for la rge and complex DNNs such as YOLOv5, and the overflow can be detected. We evaluated the overhead of the emulation time and found that it was 5.6 times slower for QNN and 42 times slower for QNN with overflow detection compared to the normal DNN execution time.

Related papers

Starting Positions Matter: A Study on Better Weight Initialization for Neural Network Quantization [71.44469196328507]
Quantization-specific model development techniques such as regularization, quantization-aware training, and quantization-robustness penalties have served to greatly boost the accuracy and robustness of modern DNNs.<n>We present an extensive study examining the effects of different weight initializations on a variety of CNN building blocks commonly used in efficient CNNs.<n>Next, we explore a new method for quantization-robust CNN initialization -- using Graph Hypernetworks (GHN) to predict parameters of quantized DNNs.
arXiv Detail & Related papers (2025-06-12T08:11:34Z)
A Converting Autoencoder Toward Low-latency and Energy-efficient DNN Inference at the Edge [4.11949030493552]
We present CBNet, a low-latency and energy-efficient deep neural network (DNN) inference framework tailored for edge devices. It utilizes a "converting" autoencoder to efficiently transform hard images into easy ones. CBNet achieves up to 4.8x speedup in inference latency and 79% reduction in energy usage compared to competing techniques.
arXiv Detail & Related papers (2024-03-11T08:13:42Z)
DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference [28.912023025671868]
This work targets an adaptive data representation with variable-length encoding called DyBit. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization.
arXiv Detail & Related papers (2023-02-24T08:46:01Z)
Attention-based Feature Compression for CNN Inference Offloading in Edge Computing [93.67044879636093]
This paper studies the computational offloading of CNN inference in device-edge co-inference systems. We propose a novel autoencoder-based CNN architecture (AECNN) for effective feature extraction at end-device. Experiments show that AECNN can compress the intermediate data by more than 256x with only about 4% accuracy loss.
arXiv Detail & Related papers (2022-11-24T18:10:01Z)
Automated machine learning for borehole resistivity measurements [0.0]
Deep neural networks (DNNs) offer a real-time solution for the inversion of borehole resistivity measurements. It is possible to use extremely large DNNs to approximate the operators, but it demands a considerable training time. In this work, we propose a scoring function that accounts for the accuracy and size of the DNNs.
arXiv Detail & Related papers (2022-07-20T12:27:22Z)
PocketNN: Integer-only Training and Inference of Neural Networks via Direct Feedback Alignment and Pocket Activations in Pure C++ [10.508187462682308]
Deep learning algorithms are implemented using floating-point real numbers. This presents an obstacle for implementing them on low-end devices which may not have dedicated floating-point units (FPUs)
arXiv Detail & Related papers (2022-01-08T16:52:34Z)
OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization. We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming. This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)
FATNN: Fast and Accurate Ternary Neural Networks [89.07796377047619]
Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts. In this work, we show that, under some mild constraints, computational complexity of the ternary inner product can be reduced by a factor of 2. We elaborately design an implementation-dependent ternary quantization algorithm to mitigate the performance gap.
arXiv Detail & Related papers (2020-08-12T04:26:18Z)
AQD: Towards Accurate Fully-Quantized Object Detection [94.06347866374927]
We propose an Accurate Quantized object Detection solution, termed AQD, to get rid of floating-point computation. Our AQD achieves comparable or even better performance compared with the full-precision counterpart under extremely low-bit schemes.
arXiv Detail & Related papers (2020-07-14T09:07:29Z)
A Framework for Semi-Automatic Precision and Accuracy Analysis for Fast and Rigorous Deep Learning [1.5863809575305419]
Many papers experimentally observe that Deep Neural Networks (DNNs) can successfully run at almost ridiculously low precision. This paper sheds some theoretical light upon why a DNN's FP accuracy stays high for low FP precision. We present a software framework for FP error analysis for the inference phase of deep-learning.
arXiv Detail & Related papers (2020-02-10T15:33:19Z)
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.