Related papers: Fast matrix multiplication for binary and ternary CNNs on ARM CPU

Fast matrix multiplication for binary and ternary CNNs on ARM CPU

URL: http://arxiv.org/abs/2205.09120v1
Date: Wed, 18 May 2022 14:52:34 GMT
Title: Fast matrix multiplication for binary and ternary CNNs on ARM CPU
Authors: Anton Trusov, Elena Limonova, Dmitry Nikolaev, Vladimir V. Arlazarov
Abstract summary: We propose fast algorithms of ternary, ternary-binary, and binary matrix multiplication for mobile devices with ARM architecture. Our algorithms can be used to implement inference of convolutional and fully connected layers of TNNs, TBNs, and BNNs. We evaluate them experimentally on ARM Cortex-A73 CPU and compare their inference speed to efficient implementations of full-precision, 8-bit, and 4-bit quantized matrix multiplications.
Score: 0.9135092203041721
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Low-bit quantized neural networks are of great interest in practical applications because they significantly reduce the consumption of both memory and computational resources. Binary neural networks are memory and computationally efficient as they require only one bit per weight and activation and can be computed using Boolean logic and bit count operations. QNNs with ternary weights and activations and binary weights and ternary activations aim to improve recognition quality compared to BNNs while preserving low bit-width. However, their efficient implementation is usually considered on ASICs and FPGAs, limiting their applicability in real-life tasks. At the same time, one of the areas where efficient recognition is most in demand is recognition on mobile devices using their CPUs. However, there are no known fast implementations of TBNs and TNN, only the daBNN library for BNNs inference. In this paper, we propose novel fast algorithms of ternary, ternary-binary, and binary matrix multiplication for mobile devices with ARM architecture. In our algorithms, ternary weights are represented using 2-bit encoding and binary - using one bit. It allows us to replace matrix multiplication with Boolean logic operations that can be computed on 128-bits simultaneously, using ARM NEON SIMD extension. The matrix multiplication results are accumulated in 16-bit integer registers. We also use special reordering of values in left and right matrices. All that allows us to efficiently compute a matrix product while minimizing the number of loads and stores compared to the algorithm from daBNN. Our algorithms can be used to implement inference of convolutional and fully connected layers of TNNs, TBNs, and BNNs. We evaluate them experimentally on ARM Cortex-A73 CPU and compare their inference speed to efficient implementations of full-precision, 8-bit, and 4-bit quantized matrix multiplications.

Related papers

Learning to Add, Multiply, and Execute Algorithmic Instructions Exactly with Neural Networks [5.3800094588915375]
We study the training dynamics of two-layer fully connected networks in the infinite-width limit.<n>We show how a sufficiently large ensemble of such models can be trained to execute exactly, with high probability.<n>We show how this can be efficiently achieved using only logarithmically many training data.
arXiv Detail & Related papers (2025-02-24T00:50:02Z)
An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks [8.779871128906787]
We propose algorithms to improve the inference time and memory efficiency of Deep Neural Networks (DNNs)<n>We focus on matrix multiplication as the bottleneck operation of inference.<n>Our experiments show up to a 5.24x speedup in the inference time.
arXiv Detail & Related papers (2024-11-10T04:56:14Z)
A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network [5.144744286453014]
A&B BNN is proposed to remove part of the multiplication operations in a traditional BNN and replace the rest with an equal number of bit operations. The mask layer can be removed during inference by leveraging the intrinsic characteristics of BNN. The quantized RPReLU structure enables more efficient bit operations by constraining its slope to be integer powers of 2.
arXiv Detail & Related papers (2024-03-06T14:28:49Z)
Compacting Binary Neural Networks by Sparse Kernel Selection [58.84313343190488]
This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed. We develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords. Experiments verify that our method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets.
arXiv Detail & Related papers (2023-03-25T13:53:02Z)
Exploiting Kernel Compression on BNNs [0.0]
In this work, we observe that the number of unique sequences representing a set of weights is typically low. We propose a clustering scheme to identify the most common sequences of bits and replace the less common ones with some similar common sequences. Our experimental results show that our technique can reduce memory requirement by 1.32x and improve performance by 1.35x.
arXiv Detail & Related papers (2022-12-01T16:05:10Z)
Sub-bit Neural Networks: Learning to Compress and Accelerate Binary Neural Networks [72.81092567651395]
Sub-bit Neural Networks (SNNs) are a new type of binary quantization design tailored to compress and accelerate BNNs. SNNs are trained with a kernel-aware optimization framework, which exploits binary quantization in the fine-grained convolutional kernel space. Experiments on visual recognition benchmarks and the hardware deployment on FPGA validate the great potentials of SNNs.
arXiv Detail & Related papers (2021-10-18T11:30:29Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)
Binary Graph Neural Networks [69.51765073772226]
Graph Neural Networks (GNNs) have emerged as a powerful and flexible framework for representation learning on irregular data. In this paper, we present and evaluate different strategies for the binarization of graph neural networks. We show that through careful design of the models, and control of the training process, binary graph neural networks can be trained at only a moderate cost in accuracy on challenging benchmarks.
arXiv Detail & Related papers (2020-12-31T18:48:58Z)
Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices [0.8362190332905524]
We show an efficient implementation of 4-bit matrix multiplication for quantized neural networks. We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset. The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.
arXiv Detail & Related papers (2020-09-14T14:48:40Z)
FATNN: Fast and Accurate Ternary Neural Networks [89.07796377047619]
Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts. In this work, we show that, under some mild constraints, computational complexity of the ternary inner product can be reduced by a factor of 2. We elaborately design an implementation-dependent ternary quantization algorithm to mitigate the performance gap.
arXiv Detail & Related papers (2020-08-12T04:26:18Z)
Efficient Integer-Arithmetic-Only Convolutional Neural Networks [87.01739569518513]
We replace conventional ReLU with Bounded ReLU and find that the decline is due to activation quantization. Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPU.
arXiv Detail & Related papers (2020-06-21T08:23:03Z)
BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs [7.635154697466773]
The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. We propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs.
arXiv Detail & Related papers (2020-05-20T08:15:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.