FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic
Arrays
- URL: http://arxiv.org/abs/2105.13434v1
- Date: Thu, 27 May 2021 20:19:39 GMT
- Title: FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic
Arrays
- Authors: Surya Selvam, Vinod Ganesan and Pratyush Kumar
- Abstract summary: We propose FuSeConv as a drop-in replacement for depth-wise separable convolution.
FuSeConv generalizes the decomposition of convolutions fully to separable 1D convolutions along spatial and depth dimensions.
We achieve a significant speed-up of 3x-7x with the MobileNet family of networks on a systolic array of size 64x64, with comparable accuracy on the ImageNet dataset.
- Score: 2.8583189395674653
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Both efficient neural networks and hardware accelerators are being explored
to speed up DNN inference on edge devices. For example, MobileNet uses
depthwise separable convolution to achieve much lower latency, while systolic
arrays provide much higher performance per watt. Interestingly however, the
combination of these two ideas is inefficient: The computational patterns of
depth-wise separable convolution are not systolic and lack data reuse to
saturate the systolic array's constrained dataflow. In this paper, we propose
FuSeConv (Fully-Separable Convolution) as a drop-in replacement for depth-wise
separable convolution. FuSeConv generalizes the decomposition of convolutions
fully to separable 1D convolutions along spatial and depth dimensions. The
resultant computation is systolic and efficiently utilizes the systolic array
with a slightly modified dataflow. With FuSeConv, we achieve a significant
speed-up of 3x-7x with the MobileNet family of networks on a systolic array of
size 64x64, with comparable accuracy on the ImageNet dataset. The high speed-up
motivates exploration of hardware-aware Neural Operator Search (NOS) in
complement to ongoing efforts on Neural Architecture Search (NAS).
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to
Real-Network Performance [54.214426436283134]
Deep neural networks, such as the Deep-FSMN, have been widely studied for keyword spotting (KWS) applications.
We present a strong yet efficient binary neural network for KWS, namely BiFSMNv2, pushing it to the real-network accuracy performance.
We highlight that benefiting from the compact architecture and optimized hardware kernel, BiFSMNv2 can achieve an impressive 25.1x speedup and 20.2x storage-saving on edge hardware.
arXiv Detail & Related papers (2022-11-13T18:31:45Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - SVNet: Where SO(3) Equivariance Meets Binarization on Point Cloud
Representation [65.4396959244269]
The paper tackles the challenge by designing a general framework to construct 3D learning architectures.
The proposed approach can be applied to general backbones like PointNet and DGCNN.
Experiments on ModelNet40, ShapeNet, and the real-world dataset ScanObjectNN, demonstrated that the method achieves a great trade-off between efficiency, rotation, and accuracy.
arXiv Detail & Related papers (2022-09-13T12:12:19Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Design and Scaffolded Training of an Efficient DNN Operator for Computer
Vision on the Edge [3.3767251810292955]
FuSeConv is a drop-in replacement for depthwise separable convolutions.
FuSeConv factorizes convolution fully along their spatial and depth dimensions.
Neural Operator Scaffolding scaffolds the training of FuSeConv by distilling knowledge from depthwise separable convolutions.
arXiv Detail & Related papers (2021-08-25T19:22:25Z) - HANT: Hardware-Aware Network Transformation [82.54824188745887]
We propose hardware-aware network transformation (HANT)
HANT replaces inefficient operations with more efficient alternatives using a neural architecture search like approach.
Our results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with 0.4% drop in the top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-07-12T18:46:34Z) - S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural
Networks [5.417507302691321]
S2Engine transmits compressed data internally and allows each processing element to dynamically select an aligned data from the compressed dataflow in convolution.
Compared to the naive systolic array, S2Engine achieves about $3.2times$ and about $3.0times$ improvements on speed and energy efficiency, respectively.
arXiv Detail & Related papers (2021-06-15T06:08:37Z) - Hardware Architecture of Embedded Inference Accelerator and Analysis of
Algorithms for Depthwise and Large-Kernel Convolutions [27.141754658998323]
The proposed architecture can support filter kernels with different sizes with high flexibility.
For image classification, the accuracy is increased by 1% by simply replacing $3 times 3$ filters with $5 times 5$ filters in depthwise convolutions.
arXiv Detail & Related papers (2021-04-29T05:45:16Z) - VolumeNet: A Lightweight Parallel Network for Super-Resolution of
Medical Volumetric Data [20.34783243852236]
We propose a 3D convolutional neural network (CNN) for SR of medical volumetric data called ParallelNet using parallel connections.
We show that the proposed VolumeNet significantly reduces the number of model parameters and achieves high precision results.
arXiv Detail & Related papers (2020-10-16T12:53:15Z) - Depth-wise Decomposition for Accelerating Separable Convolutions in
Efficient Convolutional Neural Networks [36.64158994999578]
Deep convolutional neural networks (CNNs) have been established as the primary methods for many computer vision tasks.
Recently, depth-wise separable convolution has been proposed for image recognition tasks on computationally limited platforms.
We propose a novel decomposition approach based on SVD, namely depth-wise decomposition, for expanding regular convolutions into depthwise separable convolutions.
arXiv Detail & Related papers (2019-10-21T15:37:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.