Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration
- URL: http://arxiv.org/abs/2009.02381v2
- Date: Mon, 12 Oct 2020 21:43:36 GMT
- Title: Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration
- Authors: Zhi-Gang Liu, Paul N. Whatmough, and Matthew Mattina
- Abstract summary: Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM)
Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead.
We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
- Score: 14.958793135751149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural network (CNN) inference on mobile devices demands
efficient hardware acceleration of low-precision (INT8) general matrix
multiplication (GEMM). Exploiting data sparsity is a common approach to further
accelerate GEMM for CNN inference, and in particular, structural sparsity has
the advantages of predictable load balancing and very low index overhead. In
this paper, we address a key architectural challenge with structural sparsity:
how to provide support for a range of sparsity levels while maintaining high
utilization of the hardware. We describe a time unrolled formulation of
variable density-bound block (VDBB) sparsity that allows for a configurable
number of non-zero elements per block, at constant utilization. We then
describe a systolic array microarchitecture that implements this scheme, with
two data reuse optimizations. Firstly, we increase reuse in both operands and
partial products by increasing the number of MACs per PE. Secondly, we
introduce a novel approach of moving the IM2COL transform into the hardware,
which allows us to achieve a 3x data bandwidth expansion just before the
operands are consumed by the datapath, reducing the SRAM power consumption. The
optimizations for weight sparsity, activation sparsity and data reuse are all
interrelated and therefore the optimal combination is not obvious. Therefore,
we perform an design space evaluation to find the pareto-optimal design
characteristics. The resulting design achieves 16.8 TOPS/W in 16nm with modest
50% model sparsity and scales with model sparsity up to 55.7TOPS/W at 87.5%. As
well as successfully demonstrating the variable DBB technique, this result
significantly outperforms previously reported sparse CNN accelerators.
Related papers
- TrIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks -- Part II: Architecture and Hardware Implementation [0.0]
TrIM is an innovative dataflow based on a triangular movement of inputs.
TrIM can reduce the number of memory accesses by one order of magnitude when compared to state-of-the-art systolic arrays.
architecture achieves a peak throughput of 453.6 Giga Operations per Second.
arXiv Detail & Related papers (2024-08-05T10:18:00Z) - Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators [0.0]
Deep Neural Networks (DNNs) are being developed, trained, and utilized, putting a strain on both advanced and limited devices.
Our solution is to implement em weight block sparsity, which is a structured sparsity that is friendly to hardware.
We will present performance estimates using accurate and complete code generation for AIE2 configuration sets (AMD Versal FPGAs) with Resnet50, Inception V3, and VGG16.
arXiv Detail & Related papers (2024-07-12T17:37:49Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Adaptive Filters and Aggregator Fusion for Efficient Graph Convolutions [11.769185588579488]
We present state-of-the-art performance with lower memory consumption and latency, along with characteristics suited to accelerator implementation.
Our proposal uses memory proportional to the number of vertices in the graph, in contrast to competing methods which require memory proportional to the number of edges.
We propose aggregator fusion, a technique to enable GNNs to significantly boost their representational power, with only a small increase in latency of 19% over standard sparse matrix multiplication.
arXiv Detail & Related papers (2021-04-03T20:54:36Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator
for Mobile CNN Inference [16.812184391068786]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration.
systolic array (SA) is a pipelined 2D array of processing elements (PEs)
We describe two significant improvements to the traditional SA architecture, to specifically optimize for CNN inference.
arXiv Detail & Related papers (2020-05-16T20:47:56Z) - PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal
Matrices [35.90103072918056]
Deep neural network (DNN) has emerged as the most important and popular artificial intelligent (AI) technique.
The growth of model size poses a key energy efficiency challenge for the underlying computing platform.
This paper proposes PermDNN, a novel approach to generate and execute hardware-friendly structured sparse DNN models.
arXiv Detail & Related papers (2020-04-23T02:26:40Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.