Balancing Efficiency and Flexibility for DNN Acceleration via Temporal
GPU-Systolic Array Integration
- URL: http://arxiv.org/abs/2002.08326v2
- Date: Wed, 10 Jun 2020 10:27:55 GMT
- Title: Balancing Efficiency and Flexibility for DNN Acceleration via Temporal
GPU-Systolic Array Integration
- Authors: Cong Guo, Yangjie Zhou, Jingwen Leng, Yuhao Zhu, Zidong Du, Quan Chen,
Chao Li, Bin Yao and Minyi Guo
- Abstract summary: We propose Simultaneous Multi-mode Architecture (SMA), a novel architecture design and execution model.
SMA offers general-purpose programmability on DNN accelerators in order to accelerate end-to-end applications.
The SMA achieves up to 63% performance improvement while consuming 23% less energy than the baseline architecture with Volta-Core.
- Score: 22.90145417561172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The research interest in specialized hardware accelerators for deep neural
networks (DNN) spikes recently owing to their superior performance and
efficiency. However, today's DNN accelerators primarily focus on accelerating
specific "kernels" such as convolution and matrix multiplication, which are
vital but only part of an end-to-end DNN-enabled application. Meaningful
speedups over the entire application often require supporting computations that
are, while massively parallel, ill-suited to DNN accelerators. Integrating a
general-purpose processor such as a CPU or a GPU incurs significant data
movement overhead and leads to resource under-utilization on the DNN
accelerators.
We propose Simultaneous Multi-mode Architecture (SMA), a novel architecture
design and execution model that offers general-purpose programmability on DNN
accelerators in order to accelerate end-to-end applications. The key to SMA is
the temporal integration of the systolic execution model with the GPU-like SIMD
execution model. The SMA exploits the common components shared between the
systolic-array accelerator and the GPU, and provides lightweight
reconfiguration capability to switch between the two modes in-situ. The SMA
achieves up to 63% performance improvement while consuming 23% less energy than
the baseline Volta architecture with TensorCore.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Spiker+: a framework for the generation of efficient Spiking Neural
Networks FPGA accelerators for inference at the edge [49.42371633618761]
Spiker+ is a framework for generating efficient, low-power, and low-area customized Spiking Neural Networks (SNN) accelerators on FPGA for inference at the edge.
Spiker+ is tested on two benchmark datasets, the MNIST and the Spiking Heidelberg Digits (SHD)
arXiv Detail & Related papers (2024-01-02T10:42:42Z) - FireFly: A High-Throughput Hardware Accelerator for Spiking Neural
Networks with Efficient DSP and Memory Optimization [6.966706170499345]
Spiking neural networks (SNNs) have been widely used due to their strong biological interpretability and high energy efficiency.
Most SNN hardware implementations for field-programmable gate arrays (FPGAs) cannot meet arithmetic or memory efficiency requirements.
We propose an FPGA accelerator that can process spikes generated by the firing neuron on-the-fly (FireFly)
arXiv Detail & Related papers (2023-01-05T04:28:07Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN
Accelerators for Edge Inference [0.0]
We propose SECDA, a new hardware/software co-design methodology to reduce design time of optimized Deep Neural Networks (DNN) inference accelerators on edge devices with FPGAs.
We use SECDA to efficiently develop two different DNN accelerator designs on a PYNQ-Z1 board, a platform that includes an edge FPGA.
We evaluate the two accelerator designs with four common DNN models, achieving an average performance speedup across models of up to 3.5$times$ with a 2.9$times$ reduction in energy consumption over CPU-only inference.
arXiv Detail & Related papers (2021-10-01T15:20:29Z) - RNNAccel: A Fusion Recurrent Neural Network Accelerator for Edge
Intelligence [2.055204980188575]
We present an RNN deep learning accelerator, called RNNAccel.
It supports Long Short-Term Memory (LSTM) network, Gated Recurrent Unit (GRU) network, and Fully Connected Layer (FC)/ Multiple-Perceptron Layer (MLP) networks.
The 32-MAC RNN accelerator achieves 90% MAC utilization, 1.27 TOPs/W at 40nm process, 8x compression ratio, and 90% inference accuracy.
arXiv Detail & Related papers (2020-10-26T03:36:36Z) - DANCE: Differentiable Accelerator/Network Co-Exploration [8.540518473228078]
This work presents a differentiable approach towards the co-exploration of the hardware accelerator and network architecture design.
By modeling the hardware evaluation software with a neural network, the relation between the accelerator architecture and the hardware metrics becomes differentiable.
Compared to the naive existing approaches, our method performs co-exploration in a significantly shorter time, while achieving superior accuracy and hardware cost metrics.
arXiv Detail & Related papers (2020-09-14T07:43:27Z) - SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost
Computation [97.78417228445883]
We present SmartExchange, an algorithm- hardware co-design framework for energy-efficient inference of deep neural networks (DNNs)
We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2.
We further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance.
arXiv Detail & Related papers (2020-05-07T12:12:49Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.