Sparse Periodic Systolic Dataflow for Lowering Latency and Power
Dissipation of Convolutional Neural Network Accelerators
- URL: http://arxiv.org/abs/2207.00068v1
- Date: Thu, 30 Jun 2022 19:16:46 GMT
- Title: Sparse Periodic Systolic Dataflow for Lowering Latency and Power
Dissipation of Convolutional Neural Network Accelerators
- Authors: Jung Hwan Heo, Arash Fayyazi, Amirhossein Esmaili, Massoud Pedram
- Abstract summary: This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks.
By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in hardware to create matches between the weights and activations.
- Score: 3.043665249713003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces the sparse periodic systolic (SPS) dataflow, which
advances the state-of-the-art hardware accelerator for supporting lightweight
neural networks. Specifically, the SPS dataflow enables a novel hardware design
approach unlocked by an emergent pruning scheme, periodic pattern-based
sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware
compiler optimally reorders the weights and uses a simple indexing unit in
hardware to create matches between the weights and activations. Through the
compiler-hardware codesign, SPS dataflow enjoys higher degrees of parallelism
while being free of the high indexing overhead and without model accuracy loss.
Evaluated on popular benchmarks such as VGG and ResNet, the SPS dataflow and
accompanying neural network compiler outperform prior work in convolutional
neural network (CNN) accelerator designs targeting FPGA devices. Against other
sparsity-supporting weight storage formats, SPS results in 4.49x energy
efficiency gain while lowering storage requirements by 3.67x for total weight
storage (non-pruned weights plus indexing) and 22,044x for indexing memory.
Related papers
- TrIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks -- Part II: Architecture and Hardware Implementation [0.0]
TrIM is an innovative dataflow based on a triangular movement of inputs.
TrIM can reduce the number of memory accesses by one order of magnitude when compared to state-of-the-art systolic arrays.
architecture achieves a peak throughput of 453.6 Giga Operations per Second.
arXiv Detail & Related papers (2024-08-05T10:18:00Z) - Spiker+: a framework for the generation of efficient Spiking Neural
Networks FPGA accelerators for inference at the edge [49.42371633618761]
Spiker+ is a framework for generating efficient, low-power, and low-area customized Spiking Neural Networks (SNN) accelerators on FPGA for inference at the edge.
Spiker+ is tested on two benchmark datasets, the MNIST and the Spiking Heidelberg Digits (SHD)
arXiv Detail & Related papers (2024-01-02T10:42:42Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - NPS: A Framework for Accurate Program Sampling Using Graph Neural
Network [23.021249354193305]
This paper introduces Neural Program Sampling (NPS), a novel framework that learns execution embeddings using dynamic snapshots of a Graph Neural Network.
AssemblyNet serves as NPS's graph model and neural architecture, capturing a program's behavior in aspects such as data computation, code path, and data flow.
NPS shows higher accuracy and generality than the state-of-the-art GNN approach in code behavior learning, enabling the generation of high-quality execution embeddings.
arXiv Detail & Related papers (2023-04-18T10:13:28Z) - FireFly: A High-Throughput Hardware Accelerator for Spiking Neural
Networks with Efficient DSP and Memory Optimization [6.966706170499345]
Spiking neural networks (SNNs) have been widely used due to their strong biological interpretability and high energy efficiency.
Most SNN hardware implementations for field-programmable gate arrays (FPGAs) cannot meet arithmetic or memory efficiency requirements.
We propose an FPGA accelerator that can process spikes generated by the firing neuron on-the-fly (FireFly)
arXiv Detail & Related papers (2023-01-05T04:28:07Z) - Lightweight and Progressively-Scalable Networks for Semantic
Segmentation [100.63114424262234]
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation.
In this paper, we thoroughly analyze the design of convolutional blocks and the ways of interactions across multiple scales.
We devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner.
arXiv Detail & Related papers (2022-07-27T16:00:28Z) - Efficient Hardware Acceleration of Sparsely Active Convolutional Spiking
Neural Networks [0.0]
Spiking Neural Networks (SNNs) compute in an event-based matter to achieve a more efficient computation than standard Neural Networks.
We propose a novel architecture that is optimized for the processing of Convolutional SNNs that feature a high degree of activation sparsity.
arXiv Detail & Related papers (2022-03-23T14:18:58Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio
Sparsification [3.3711251611130337]
A hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented.
Results show that the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.
arXiv Detail & Related papers (2021-01-07T18:23:48Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.