SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN
Accelerators for Edge Inference
- URL: http://arxiv.org/abs/2110.00478v1
- Date: Fri, 1 Oct 2021 15:20:29 GMT
- Title: SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN
Accelerators for Edge Inference
- Authors: Jude Haris, Perry Gibson, Jos\'e Cano, Nicolas Bohm Agostini, David
Kaeli
- Abstract summary: We propose SECDA, a new hardware/software co-design methodology to reduce design time of optimized Deep Neural Networks (DNN) inference accelerators on edge devices with FPGAs.
We use SECDA to efficiently develop two different DNN accelerator designs on a PYNQ-Z1 board, a platform that includes an edge FPGA.
We evaluate the two accelerator designs with four common DNN models, achieving an average performance speedup across models of up to 3.5$times$ with a 2.9$times$ reduction in energy consumption over CPU-only inference.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Edge computing devices inherently face tight resource constraints, which is
especially apparent when deploying Deep Neural Networks (DNN) with high memory
and compute demands. FPGAs are commonly available in edge devices. Since these
reconfigurable circuits can achieve higher throughput and lower power
consumption than general purpose processors, they are especially well-suited
for DNN acceleration. However, existing solutions for designing FPGA-based DNN
accelerators for edge devices come with high development overheads, given the
cost of repeated FPGA synthesis passes, reimplementation in a Hardware
Description Language (HDL) of the simulated design, and accelerator system
integration.
In this paper we propose SECDA, a new hardware/software co-design methodology
to reduce design time of optimized DNN inference accelerators on edge devices
with FPGAs. SECDA combines cost-effective SystemC simulation with hardware
execution, streamlining design space exploration and the development process
via reduced design evaluation time. As a case study, we use SECDA to
efficiently develop two different DNN accelerator designs on a PYNQ-Z1 board, a
platform that includes an edge FPGA. We quickly and iteratively explore the
system's hardware/software stack, while identifying and mitigating performance
bottlenecks. We evaluate the two accelerator designs with four common DNN
models, achieving an average performance speedup across models of up to
3.5$\times$ with a 2.9$\times$ reduction in energy consumption over CPU-only
inference. Our code is available at https://github.com/gicLAB/SECDA
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - FireFly v2: Advancing Hardware Support for High-Performance Spiking
Neural Network with a Spatiotemporal FPGA Accelerator [8.0611988136866]
Spiking Neural Networks (SNNs) are expected to be a promising alternative to Artificial Neural Networks (ANNs)
Specialized SNN hardware offers clear advantages over general-purpose devices in terms of power and performance.
FireFly v2, an FPGA SNN accelerator, can address the issue of non-spike operation in current SOTA SNN algorithms.
arXiv Detail & Related papers (2023-09-28T04:17:02Z) - End-to-end codesign of Hessian-aware quantized neural networks for FPGAs
and ASICs [49.358119307844035]
We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs)
This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow.
We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the Large Hadron Collider (LHC)
We implement an optimized mixed-precision NN for high-momentum particle jets in simulated LHC proton-proton collisions.
arXiv Detail & Related papers (2023-04-13T18:00:01Z) - HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on
FPGA Devices [71.45672882756001]
This study introduces a novel streaming architecture based toolflow for mapping 3D Convolutional Neural Networks onto FPGAs.
The HARFLOW3D toolflow takes as input a 3D CNN in ONNX format and a description of the FPGA characteristics.
The ability of the toolflow to support a broad range of models and devices is shown through a number of experiments on various 3D CNN and FPGA system pairs.
arXiv Detail & Related papers (2023-03-30T08:25:27Z) - Optimization of FPGA-based CNN Accelerators Using Metaheuristics [1.854931308524932]
convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields.
FPGAs have seen a surge in interest for accelerating CNN inference.
Current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs)
arXiv Detail & Related papers (2022-09-22T18:57:49Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - HALF: Holistic Auto Machine Learning for FPGAs [1.9146960682777232]
Deep Neural Networks (DNNs) are capable of solving complex problems in domains related to embedded systems, such as image and natural language processing.
To efficiently implement DNNs on a specific FPGA platform for a given cost criterion, e.g. energy efficiency, an enormous amount of design parameters has to be considered.
An automatic, holistic design approach can improve the quality of DNN implementations on FPGA significantly.
arXiv Detail & Related papers (2021-06-28T14:45:47Z) - DNN-Chip Predictor: An Analytical Performance Predictor for DNN
Accelerators with Various Dataflows and Hardware Architectures [30.689015188050405]
The recent breakthroughs in deep neural networks (DNNs) have spurred a tremendously increased demand for DNN accelerators.
DNN-Chip Predictor is an analytical performance predictor which can accurately predict DNN accelerators' energy, throughput, and latency prior to their actual implementation.
arXiv Detail & Related papers (2020-02-26T02:59:18Z) - AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs
and ASICs [36.490296335959485]
AutoDNNchip is a chip generator that can automatically generate both FPGA- and ASIC-based DNN chip implementation for a designated application and dataset.
Our Chip Predictor's predicted performance differs from real-measured ones by 10% when validated.
accelerators generated by our AutoDNNchip can achieve better (up to 3.86X improvement) performance than that of expert-crafted state-of-the-art accelerators.
arXiv Detail & Related papers (2020-01-06T05:32:15Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.