Related papers: FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic Arrays

FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic Arrays

URL: http://arxiv.org/abs/2105.13434v1
Date: Thu, 27 May 2021 20:19:39 GMT
Title: FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic Arrays
Authors: Surya Selvam, Vinod Ganesan and Pratyush Kumar
Abstract summary: We propose FuSeConv as a drop-in replacement for depth-wise separable convolution. FuSeConv generalizes the decomposition of convolutions fully to separable 1D convolutions along spatial and depth dimensions. We achieve a significant speed-up of 3x-7x with the MobileNet family of networks on a systolic array of size 64x64, with comparable accuracy on the ImageNet dataset.
Score: 2.8583189395674653
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Both efficient neural networks and hardware accelerators are being explored to speed up DNN inference on edge devices. For example, MobileNet uses depthwise separable convolution to achieve much lower latency, while systolic arrays provide much higher performance per watt. Interestingly however, the combination of these two ideas is inefficient: The computational patterns of depth-wise separable convolution are not systolic and lack data reuse to saturate the systolic array's constrained dataflow. In this paper, we propose FuSeConv (Fully-Separable Convolution) as a drop-in replacement for depth-wise separable convolution. FuSeConv generalizes the decomposition of convolutions fully to separable 1D convolutions along spatial and depth dimensions. The resultant computation is systolic and efficiently utilizes the systolic array with a slightly modified dataflow. With FuSeConv, we achieve a significant speed-up of 3x-7x with the MobileNet family of networks on a systolic array of size 64x64, with comparable accuracy on the ImageNet dataset. The high speed-up motivates exploration of hardware-aware Neural Operator Search (NOS) in complement to ongoing efforts on Neural Architecture Search (NAS).

Related papers

TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to Real-Network Performance [54.214426436283134]
Deep neural networks, such as the Deep-FSMN, have been widely studied for keyword spotting (KWS) applications. We present a strong yet efficient binary neural network for KWS, namely BiFSMNv2, pushing it to the real-network accuracy performance. We highlight that benefiting from the compact architecture and optimized hardware kernel, BiFSMNv2 can achieve an impressive 25.1x speedup and 20.2x storage-saving on edge hardware.
arXiv Detail & Related papers (2022-11-13T18:31:45Z)
Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel. Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU. Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z)
SVNet: Where SO(3) Equivariance Meets Binarization on Point Cloud Representation [65.4396959244269]
The paper tackles the challenge by designing a general framework to construct 3D learning architectures. The proposed approach can be applied to general backbones like PointNet and DGCNN. Experiments on ModelNet40, ShapeNet, and the real-world dataset ScanObjectNN, demonstrated that the method achieves a great trade-off between efficiency, rotation, and accuracy.
arXiv Detail & Related papers (2022-09-13T12:12:19Z)
DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels. We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z)
Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge [3.3767251810292955]
FuSeConv is a drop-in replacement for depthwise separable convolutions. FuSeConv factorizes convolution fully along their spatial and depth dimensions. Neural Operator Scaffolding scaffolds the training of FuSeConv by distilling knowledge from depthwise separable convolutions.
arXiv Detail & Related papers (2021-08-25T19:22:25Z)
HANT: Hardware-Aware Network Transformation [82.54824188745887]
We propose hardware-aware network transformation (HANT) HANT replaces inefficient operations with more efficient alternatives using a neural architecture search like approach. Our results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with 0.4% drop in the top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-07-12T18:46:34Z)
S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks [5.417507302691321]
S2Engine transmits compressed data internally and allows each processing element to dynamically select an aligned data from the compressed dataflow in convolution. Compared to the naive systolic array, S2Engine achieves about $3.2times$ and about $3.0times$ improvements on speed and energy efficiency, respectively.
arXiv Detail & Related papers (2021-06-15T06:08:37Z)
Hardware Architecture of Embedded Inference Accelerator and Analysis of Algorithms for Depthwise and Large-Kernel Convolutions [27.141754658998323]
The proposed architecture can support filter kernels with different sizes with high flexibility. For image classification, the accuracy is increased by 1% by simply replacing $3 times 3$ filters with $5 times 5$ filters in depthwise convolutions.
arXiv Detail & Related papers (2021-04-29T05:45:16Z)
VolumeNet: A Lightweight Parallel Network for Super-Resolution of Medical Volumetric Data [20.34783243852236]
We propose a 3D convolutional neural network (CNN) for SR of medical volumetric data called ParallelNet using parallel connections. We show that the proposed VolumeNet significantly reduces the number of model parameters and achieves high precision results.
arXiv Detail & Related papers (2020-10-16T12:53:15Z)
Depth-wise Decomposition for Accelerating Separable Convolutions in Efficient Convolutional Neural Networks [36.64158994999578]
Deep convolutional neural networks (CNNs) have been established as the primary methods for many computer vision tasks. Recently, depth-wise separable convolution has been proposed for image recognition tasks on computationally limited platforms. We propose a novel decomposition approach based on SVD, namely depth-wise decomposition, for expanding regular convolutions into depthwise separable convolutions.
arXiv Detail & Related papers (2019-10-21T15:37:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.