Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator
for Mobile CNN Inference
- URL: http://arxiv.org/abs/2005.08098v1
- Date: Sat, 16 May 2020 20:47:56 GMT
- Title: Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator
for Mobile CNN Inference
- Authors: Zhi-Gang Liu, Paul N. Whatmough, Matthew Mattina
- Abstract summary: Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration.
systolic array (SA) is a pipelined 2D array of processing elements (PEs)
We describe two significant improvements to the traditional SA architecture, to specifically optimize for CNN inference.
- Score: 16.812184391068786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural network (CNN) inference on mobile devices demands
efficient hardware acceleration of low-precision (INT8) general matrix
multiplication (GEMM). The systolic array (SA) is a pipelined 2D array of
processing elements (PEs), with very efficient local data movement, well suited
to accelerating GEMM, and widely deployed in industry. In this work, we
describe two significant improvements to the traditional SA architecture, to
specifically optimize for CNN inference. Firstly, we generalize the traditional
scalar PE, into a Tensor-PE, which gives rise to a family of new Systolic
Tensor Array (STA) microarchitectures. The STA family increases intra-PE
operand reuse and datapath efficiency, resulting in circuit area and power
dissipation reduction of as much as 2.08x and 1.36x respectively, compared to
the conventional SA at iso-throughput with INT8 operands. Secondly, we extend
this design to support a novel block-sparse data format called density-bound
block (DBB). This variant (STA-DBB) achieves a 3.14x and 1.97x improvement over
the SA baseline at iso-throughput in area and power respectively, when
processing specially-trained DBB-sparse models, while remaining fully backwards
compatible with dense models.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - TrIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks -- Part II: Architecture and Hardware Implementation [0.0]
TrIM is an innovative dataflow based on a triangular movement of inputs.
TrIM can reduce the number of memory accesses by one order of magnitude when compared to state-of-the-art systolic arrays.
architecture achieves a peak throughput of 453.6 Giga Operations per Second.
arXiv Detail & Related papers (2024-08-05T10:18:00Z) - BDC-Occ: Binarized Deep Convolution Unit For Binarized Occupancy Network [55.21288428359509]
Existing 3D occupancy networks demand significant hardware resources, hindering the deployment of edge devices.
We propose a novel binarized deep convolution (BDC) unit that effectively enhances performance while increasing the number of binarized convolutional layers.
Our BDC-Occ model is created by applying the proposed BDC unit to binarize the existing 3D occupancy networks.
arXiv Detail & Related papers (2024-05-27T10:44:05Z) - Pruning for Improved ADC Efficiency in Crossbar-based Analog In-memory Accelerators [9.169425049927554]
Crossbar-based analog in-memory architectures are attractive for acceleration of deep neural networks (DNN)
They require analog-to-digital converters (ADCs) to communicate crossbar outputs.
ADCs consume a significant portion of energy and area of every crossbar processing unit.
We motivate crossbar-attuned pruning to target ADC-specific inefficiencies.
arXiv Detail & Related papers (2024-03-19T18:26:45Z) - Point Transformer V3: Simpler, Faster, Stronger [88.80496333515325]
This paper focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing.
We present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms.
PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios.
arXiv Detail & Related papers (2023-12-15T18:59:59Z) - BiFSMNv2: Pushing Binary Neural Networks for Keyword Spotting to
Real-Network Performance [54.214426436283134]
Deep neural networks, such as the Deep-FSMN, have been widely studied for keyword spotting (KWS) applications.
We present a strong yet efficient binary neural network for KWS, namely BiFSMNv2, pushing it to the real-network accuracy performance.
We highlight that benefiting from the compact architecture and optimized hardware kernel, BiFSMNv2 can achieve an impressive 25.1x speedup and 20.2x storage-saving on edge hardware.
arXiv Detail & Related papers (2022-11-13T18:31:45Z) - BiFSMN: Binary Neural Network for Keyword Spotting [47.46397208920726]
BiFSMN is an accurate and extreme-efficient binary neural network for KWS.
We show that BiFSMN can achieve an impressive 22.3x speedup and 15.5x storage-saving on real-world edge hardware.
arXiv Detail & Related papers (2022-02-14T05:16:53Z) - S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN
Acceleration [21.110711058376534]
Exploiting sparsity is a key technique in accelerating quantized convolutional neural network (CNN) inference on mobile devices.
We propose to exploit structured sparsity, more specifically, Density Bound Block (DBB) sparsity for both weights and activations.
We describe S2TA, a systolic array-based CNN accelerator that exploits joint weight and activation DBB sparsity.
arXiv Detail & Related papers (2021-07-16T15:57:06Z) - Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM)
Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead.
We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z) - Binary DAD-Net: Binarized Driveable Area Detection Network for
Autonomous Driving [94.40107679615618]
This paper proposes a novel binarized driveable area detection network (binary DAD-Net)
It uses only binary weights and activations in the encoder, the bottleneck, and the decoder part.
It outperforms state-of-the-art semantic segmentation networks on public datasets.
arXiv Detail & Related papers (2020-06-15T07:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.