Related papers: Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices

Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices

URL: http://arxiv.org/abs/2009.06488v2
Date: Tue, 20 Oct 2020 15:23:20 GMT
Title: Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices
Authors: Anton Trusov, Elena Limonova, Dmitry Slugin, Dmitry Nikolaev, Vladimir V. Arlazarov
Abstract summary: We show an efficient implementation of 4-bit matrix multiplication for quantized neural networks. We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset. The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.
Score: 0.8362190332905524
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantized low-precision neural networks are very popular because they require less computational resources for inference and can provide high performance, which is vital for real-time and embedded recognition systems. However, their advantages are apparent for FPGA and ASIC devices, while general-purpose processor architectures are not always able to perform low-bit integer computations efficiently. The most frequently used low-precision neural network model for mobile central processors is an 8-bit quantized network. However, in a number of cases, it is possible to use fewer bits for weights and activations, and the only problem is the difficulty of efficient implementation. We introduce an efficient implementation of 4-bit matrix multiplication for quantized neural networks and perform time measurements on a mobile ARM processor. It shows 2.9 times speedup compared to standard floating-point multiplication and is 1.5 times faster than 8-bit quantized one. We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset. 4-bit quantization gives 95.0% accuracy and 48% overall inference speedup, while an 8-bit quantized network gives 95.4% accuracy and 39% speedup. The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.

Related papers

Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference [3.7687375904925484]
We propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation.<n>We develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead.
arXiv Detail & Related papers (2025-05-20T17:26:12Z)
DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z)
DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference [28.912023025671868]
This work targets an adaptive data representation with variable-length encoding called DyBit. We also propose a hardware-aware quantization framework with a mixed-precision accelerator to trade-off the inference accuracy and speedup. Experimental results demonstrate that the inference accuracy via DyBit is 1.997% higher than the state-of-the-art at 4-bit quantization.
arXiv Detail & Related papers (2023-02-24T08:46:01Z)
Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets [7.5195830365852085]
We propose a novel sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data.
arXiv Detail & Related papers (2022-07-13T17:46:08Z)
OMPQ: Orthogonal Mixed Precision Quantization [64.59700856607017]
Mixed precision quantization takes advantage of hardware's multiple bit-width arithmetic operations to unleash the full potential of network quantization. We propose to optimize a proxy metric, the concept of networkity, which is highly correlated with the loss of the integer programming. This approach reduces the search time and required data amount by orders of magnitude, with little compromise on quantization accuracy.
arXiv Detail & Related papers (2021-09-16T10:59:33Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)
Post-Training Sparsity-Aware Quantization [2.2530496464901106]
Quantization is a technique used in deep neural networks (DNNs) to increase execution performance and hardware efficiency. We propose a sparsity-aware quantization (SPARQ) method, in which the unstructured and dynamic activation sparsity is leveraged in different representation granularities. SPARQ achieves minor accuracy degradation, 2x speedup over widely used hardware architectures, and a practical hardware implementation.
arXiv Detail & Related papers (2021-05-23T20:12:35Z)
On the quantization of recurrent neural networks [9.549757800469196]
quantization of neural networks can be defined as the approximation of the high precision computation of the canonical neural network formulation. We present an integer-only quantization strategy for Long Short-Term Memory (LSTM) neural network topologies.
arXiv Detail & Related papers (2021-01-14T04:25:08Z)
HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
Searching for Low-Bit Weights in Quantized Neural Networks [129.8319019563356]
Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators. We present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately.
arXiv Detail & Related papers (2020-09-18T09:13:26Z)
Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization. Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)
Quantized Neural Network Inference with Precision Batching [4.519884877213097]
Precision decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations. Precision yields end-to-endups of over 8x on a GPU within a 1% error margin of the full precision baseline. Across a variety of applications, Precision yields end-to-endups of over 8x on a GPU within a 1% error margin of the full precision baseline.
arXiv Detail & Related papers (2020-02-26T19:34:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.