Related papers: Phantom: A High-Performance Computational Core for Sparse Convolutional Neural Networks

Phantom: A High-Performance Computational Core for Sparse Convolutional Neural Networks

URL: http://arxiv.org/abs/2111.05002v1
Date: Tue, 9 Nov 2021 08:43:03 GMT
Title: Phantom: A High-Performance Computational Core for Sparse Convolutional Neural Networks
Authors: Mahmood Azhar Qureshi, Arslan Munir
Abstract summary: Sparse convolutional neural networks (CNNs) have gained significant traction over the past few years. They can drastically decrease the model size and computations, if exploited befittingly, as compared to their dense counterparts. Recently proposed sparse accelerators like SCNN, Eyeriss v2, and SparTen, actively exploit the two-sided or full sparsity, that is, sparsity in both weights and activations, for performance gains. These accelerators either have inefficient micro-architecture, which limits their performance, have no support for non-unit stride convolutions and fully-connected layers, or suffer
Score: 3.198144010381572
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse convolutional neural networks (CNNs) have gained significant traction over the past few years as sparse CNNs can drastically decrease the model size and computations, if exploited befittingly, as compared to their dense counterparts. Sparse CNNs often introduce variations in the layer shapes and sizes, which can prevent dense accelerators from performing well on sparse CNN models. Recently proposed sparse accelerators like SCNN, Eyeriss v2, and SparTen, actively exploit the two-sided or full sparsity, that is, sparsity in both weights and activations, for performance gains. These accelerators, however, either have inefficient micro-architecture, which limits their performance, have no support for non-unit stride convolutions and fully-connected (FC) layers, or suffer massively from systematic load imbalance. To circumvent these issues and support both sparse and dense models, we propose Phantom, a multi-threaded, dynamic, and flexible neural computational core. Phantom uses sparse binary mask representation to actively lookahead into sparse computations, and dynamically schedule its computational threads to maximize the thread utilization and throughput. We also generate a two-dimensional (2D) mesh architecture of Phantom neural computational cores, which we refer to as Phantom-2D accelerator, and propose a novel dataflow that supports all layers of a CNN, including unit and non-unit stride convolutions, and FC layers. In addition, Phantom-2D uses a two-level load balancing strategy to minimize the computational idling, thereby, further improving the hardware utilization. To show support for different types of layers, we evaluate the performance of the Phantom architecture on VGG16 and MobileNet. Our simulations show that the Phantom-2D accelerator attains a performance gain of 12x, 4.1x, 1.98x, and 2.36x, over dense architectures, SCNN, SparTen, and Eyeriss v2, respectively.

Related papers

Mitigating Memory Wall Effects in CNN Engines with On-the-Fly Weights Generation [13.681095158525514]
unzipFPGA is a novel CNN inference system that counteracts the limitations of existing CNN engines. We introduce a weights generator module that enables the on-chip on-the-fly generation of weights. We further enhance unzipFPGA with an automated hardware-aware methodology that tailors the weights generation mechanism to the target CNN-device pair.
arXiv Detail & Related papers (2023-07-25T11:19:21Z)
InceptionNeXt: When Inception Meets ConvNeXt [147.50287103414115]
We build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance. InceptionNeXt achieves 1.6x higher training throughputs than ConvNeXt-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K.
arXiv Detail & Related papers (2023-03-29T17:59:58Z)
BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to Real-Network Performance [54.214426436283134]
Deep neural networks, such as the Deep-FSMN, have been widely studied for keyword spotting (KWS) applications. We present a strong yet efficient binary neural network for KWS, namely BiFSMNv2, pushing it to the real-network accuracy performance. We highlight that benefiting from the compact architecture and optimized hardware kernel, BiFSMNv2 can achieve an impressive 25.1x speedup and 20.2x storage-saving on edge hardware.
arXiv Detail & Related papers (2022-11-13T18:31:45Z)
EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators [12.223778147172107]
Dilated and transposed convolutions are widely used in modern convolutional neural networks (CNNs) These kernels stress current compute systems due to their high memory intensity, exascale compute demands, and large energy consumption. We propose EcoFlow, a new set of dataflows and mapping algorithms for dilated and transposed convolutions.
arXiv Detail & Related papers (2022-02-04T18:48:36Z)
DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels. We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z)
Content-Aware Convolutional Neural Networks [98.97634685964819]
Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers. We propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel.
arXiv Detail & Related papers (2021-06-30T03:54:35Z)
Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z)
GhostSR: Learning Ghost Features for Efficient Image Super-Resolution [49.393251361038025]
Single image super-resolution (SISR) system based on convolutional neural networks (CNNs) achieves fancy performance while requires huge computational costs. We propose to use shift operation to generate the redundant features (i.e., Ghost features) of SISR models. We show that both the non-compact and lightweight SISR models embedded in our proposed module can achieve comparable performance to that of their baselines.
arXiv Detail & Related papers (2021-01-21T10:09:47Z)
When deep learning models on GPU can be accelerated by taking advantage of unstructured sparsity [0.0]
This paper is focused on the improvement the efficiency of the sparse convolutional neural networks (CNNs) layers on graphic processing units ( GPU) The modern CNN models need megabytes of coefficients and needed millions MAC operations to perform convolution. We show when is worth using a direct sparse operation to speed-up the calculation of the convolution layers.
arXiv Detail & Related papers (2020-11-12T10:13:48Z)
RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks on Mobile Devices [57.877112704841366]
This paper proposes RT3D, a model compression and mobile acceleration framework for 3D CNNs. For the first time, real-time execution of 3D CNNs is achieved on off-the-shelf mobiles.
arXiv Detail & Related papers (2020-07-20T02:05:32Z)
Learning Sparse & Ternary Neural Networks with Entropy-Constrained Trained Ternarization (EC2T) [17.13246260883765]
Deep neural networks (DNNs) have shown remarkable success in a variety of machine learning applications. In recent years, there is an increasing interest in deploying DNNs to resource-constrained devices with limited energy, memory, and computational budget. We propose Entropy-Constrained Trained Ternarization (EC2T), a general framework to create sparse and ternary neural networks.
arXiv Detail & Related papers (2020-04-02T15:38:00Z)
Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs [6.035819238203187]
We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance. We also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3x with cuDNN and above 10x with Arm Compute Library and TVM.
arXiv Detail & Related papers (2020-02-20T12:07:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.