Fast Implementation of 4-bit Convolutional Neural Networks for Mobile
Devices
- URL: http://arxiv.org/abs/2009.06488v2
- Date: Tue, 20 Oct 2020 15:23:20 GMT
- Title: Fast Implementation of 4-bit Convolutional Neural Networks for Mobile
Devices
- Authors: Anton Trusov, Elena Limonova, Dmitry Slugin, Dmitry Nikolaev, Vladimir
V. Arlazarov
- Abstract summary: We show an efficient implementation of 4-bit matrix multiplication for quantized neural networks.
We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset.
The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.
- Score: 0.8362190332905524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Quantized low-precision neural networks are very popular because they require
less computational resources for inference and can provide high performance,
which is vital for real-time and embedded recognition systems. However, their
advantages are apparent for FPGA and ASIC devices, while general-purpose
processor architectures are not always able to perform low-bit integer
computations efficiently. The most frequently used low-precision neural network
model for mobile central processors is an 8-bit quantized network. However, in
a number of cases, it is possible to use fewer bits for weights and
activations, and the only problem is the difficulty of efficient
implementation. We introduce an efficient implementation of 4-bit matrix
multiplication for quantized neural networks and perform time measurements on a
mobile ARM processor. It shows 2.9 times speedup compared to standard
floating-point multiplication and is 1.5 times faster than 8-bit quantized one.
We also demonstrate a 4-bit quantized neural network for OCR recognition on the
MIDV-500 dataset. 4-bit quantization gives 95.0% accuracy and 48% overall
inference speedup, while an 8-bit quantized network gives 95.4% accuracy and
39% speedup. The results show that 4-bit quantization perfectly suits mobile
devices, yielding good enough accuracy and low inference time.
Related papers
- DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural
Network Inference [28.912023025671868]
This work targets an adaptive data representation with variable-length encoding called DyBit.
We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup.
Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization.
arXiv Detail & Related papers (2023-02-24T08:46:01Z) - Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded
Chipsets [7.5195830365852085]
We propose a novel sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model.
We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data.
arXiv Detail & Related papers (2022-07-13T17:46:08Z) - OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization.
We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming.
This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - Post-Training Sparsity-Aware Quantization [2.2530496464901106]
Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency.
We propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities.
SPARQ achieves minor accuracy degradation, 2x speedup over widely used hardware architectures, and a practical hardware implementation.
arXiv Detail & Related papers (2021-05-23T20:12:35Z) - On the quantization of recurrent neural networks [9.549757800469196]
quantization of neural networks can be defined as the approximation of the high precision computation of the canonical neural network formulation.
We present an integer-only quantization strategy for Long Short-Term Memory (LSTM) neural network topologies.
arXiv Detail & Related papers (2021-01-14T04:25:08Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z) - Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators.
We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z) - Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization.
Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z) - Quantized Neural Network Inference with Precision Batching [4.519884877213097]
Precision decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations.
Precision yields end-to-endups of over 8x on a GPU within a 1% error margin of the full precision baseline.
Across a variety of applications, Precision yields end-to-endups of over 8x on a GPU within a 1% error margin of the full precision baseline.
arXiv Detail & Related papers (2020-02-26T19:34:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.