S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN
Acceleration
- URL: http://arxiv.org/abs/2107.07983v1
- Date: Fri, 16 Jul 2021 15:57:06 GMT
- Title: S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN
Acceleration
- Authors: Zhi-Gang Liu, Paul N. Whatmough, Yuhao Zhu, Matthew Mattina
- Abstract summary: Exploiting sparsity is a key technique in accelerating quantized convolutional neural network (CNN) inference on mobile devices.
We propose to exploit structured sparsity, more specifically, Density Bound Block (DBB) sparsity for both weights and activations.
We describe S2TA, a systolic array-based CNN accelerator that exploits joint weight and activation DBB sparsity.
- Score: 21.110711058376534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exploiting sparsity is a key technique in accelerating quantized
convolutional neural network (CNN) inference on mobile devices. Prior sparse
CNN accelerators largely exploit un-structured sparsity and achieve significant
speedups. Due to the unbounded, largely unpredictable sparsity patterns,
however, exploiting unstructured sparsity requires complicated hardware design
with significant energy and area overhead, which is particularly detrimental to
mobile/IoT inference scenarios where energy and area efficiency are crucial. We
propose to exploit structured sparsity, more specifically, Density Bound Block
(DBB) sparsity for both weights and activations. DBB block tensors bound the
maximum number of non-zeros per block. DBB thus exposes statically predictable
sparsity patterns that enable lean sparsity-exploiting hardware. We propose new
hardware primitives to implement DBB sparsity for (static) weights and
(dynamic) activations, respectively, with very low overheads. Building on top
of the primitives, we describe S2TA, a systolic array-based CNN accelerator
that exploits joint weight and activation DBB sparsity and new dimensions of
data reuse unavailable on the traditional systolic array. S2TA in 16nm achieves
more than 2x speedup and energy reduction compared to a strong baseline of a
systolic array with zero-value clock gating, over five popular CNN benchmarks.
Compared to two recent non-systolic sparse accelerators, Eyeriss v2 (65nm) and
SparTen (45nm), S2TA in 65nm uses about 2.2x and 3.1x less energy per
inference, respectively.
Related papers
- BDC-Occ: Binarized Deep Convolution Unit For Binarized Occupancy Network [55.21288428359509]
Existing 3D occupancy networks demand significant hardware resources, hindering the deployment of edge devices.
We propose a novel binarized deep convolution (BDC) unit that effectively enhances performance while increasing the number of binarized convolutional layers.
Our BDC-Occ model is created by applying the proposed BDC unit to binarize the existing 3D occupancy networks.
arXiv Detail & Related papers (2024-05-27T10:44:05Z) - Signed Binary Weight Networks [17.07866119979333]
Two important algorithmic techniques have shown promise for enabling efficient inference - sparsity and binarization.
We propose a new method called signed-binary networks to improve efficiency further.
Our method achieves comparable accuracy on ImageNet and CIFAR10 datasets with binary and can lead to 69% sparsity.
arXiv Detail & Related papers (2022-11-25T00:19:21Z) - Two Sparsities Are Better Than One: Unlocking the Performance Benefits
of Sparse-Sparse Networks [0.0]
We introduce Complementary Sparsity, a technique that significantly improves the performance of dual sparse networks on existing hardware.
We show up to 100X improvement in throughput and energy efficiency performing inference on FPGAs.
Our results suggest that weight plus activation sparsity can be a potent combination for efficiently scaling future AI models.
arXiv Detail & Related papers (2021-12-27T20:41:01Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Dynamic Slimmable Network [105.74546828182834]
We develop a dynamic network slimming regime named Dynamic Slimmable Network (DS-Net)
Our DS-Net is empowered with the ability of dynamic inference by the proposed double-headed dynamic gate.
It consistently outperforms its static counterparts as well as state-of-the-art static and dynamic model compression methods.
arXiv Detail & Related papers (2021-03-24T15:25:20Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z) - Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM)
Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead.
We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z) - Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator
for Mobile CNN Inference [16.812184391068786]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration.
systolic array (SA) is a pipelined 2D array of processing elements (PEs)
We describe two significant improvements to the traditional SA architecture, to specifically optimize for CNN inference.
arXiv Detail & Related papers (2020-05-16T20:47:56Z) - SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost
Computation [97.78417228445883]
We present SmartExchange, an algorithm- hardware co-design framework for energy-efficient inference of deep neural networks (DNNs)
We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2.
We further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance.
arXiv Detail & Related papers (2020-05-07T12:12:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.