Related papers: Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology

Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology

URL: http://arxiv.org/abs/2410.16298v2
Date: Wed, 30 Oct 2024 10:55:11 GMT
Title: Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology
Authors: Anagha Nimbekar, Prabodh Katti, Chen Li, Bashir M. Al-Hashimi, Amit Acharyya, Bipin Rajendran,
Abstract summary: Spiking Neural Networks (SNNs) have emerged as a promising approach to improve the energy efficiency of machine learning models. We develop a hardware-software co-optimisation strategy to port software-trained deep neural networks (DNN) to reduced-precision spiking models.
Score: 2.968768532937366
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Spiking Neural Networks (SNNs) have emerged as a promising approach to improve the energy efficiency of machine learning models, as they naturally implement event-driven computations while avoiding expensive multiplication operations. In this paper, we develop a hardware-software co-optimisation strategy to port software-trained deep neural networks (DNN) to reduced-precision spiking models demonstrating fast and accurate inference in a novel event-driven CMOS reconfigurable spiking inference accelerator. Experimental results show that a reduced-precision Resnet-18 and VGG-11 SNN models achieves classification accuracy within 1% of the baseline full-precision DNN model within 8 spike timesteps. We also demonstrate an FPGA prototype implementation of the spiking inference accelerator with a throughput of 38.4 giga operations per second (GOPS) consuming 1.54 Watts on PYNQ-Z2 FPGA. This corresponds to 0.6 GOPS per processing element and 2.25,GOPS/DSP slice, which is 2x and 4.5x higher utilisation efficiency respectively compared to the state-of-the-art. Our co-optimisation strategy can be employed to develop deep reduced precision SNN models and port them to resource-efficient event-driven hardware accelerators for edge applications.

Related papers

Dynamic Tsetlin Machine Accelerators for On-Chip Training at the Edge using FPGAs [0.3440236962613469]
This paper presents a Dynamic Tsetlin Machine (DTM) training accelerator as an alternative to Deep Neural Networks (DNNs) DTM trains with fewer multiply-accumulates, devoid of derivative computation. The proposed accelerator offers 2.54x more Giga operations per second per Watt (GOP/s per W) and uses 6x less power than the next-best comparable design.
arXiv Detail & Related papers (2025-04-28T13:38:53Z)
Efficient FPGA-accelerated Convolutional Neural Networks for Cloud Detection on CubeSats [0.5420492913071214]
We present the implementation of four FPGA-accelerated convolutional neural network (CNN) models for onboard cloud detection in resource-constrained CubeSat missions. This study explores both pixel-wise (Pixel-Net and Patch-Net) and image-wise (U-Net and Scene-Net) models to benchmark trade-offs in accuracy, latency, and model complexity. All models retained high accuracy post-FPGA integration, with a cumulative maximum accuracy drop of only 0.6% after quantization and pruning.
arXiv Detail & Related papers (2025-04-04T19:32:47Z)
EPIM: Efficient Processing-In-Memory Accelerators based on Epitome [78.79382890789607]
We introduce the Epitome, a lightweight neural operator offering convolution-like functionality. On the software side, we evaluate epitomes' latency and energy on PIM accelerators. We introduce a PIM-aware layer-wise design method to enhance their hardware efficiency.
arXiv Detail & Related papers (2023-11-12T17:56:39Z)
PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration [14.091100907457037]
Product quantization (PQ) has been applied to deep neural networks (DNNs) to replace multiply-accumulate (MAC) operations. We create a custom hardware accelerator to parallelize and accelerate nearest-neighbor search and dot-product lookups. We identify PQ configurations that improve performance-per-area for ResNet20 by up to 3.1$times$, even when compared to a highly optimized conventional DNN accelerator.
arXiv Detail & Related papers (2023-05-25T09:27:01Z)
End-to-end codesign of Hessian-aware quantized neural networks for FPGAs and ASICs [49.358119307844035]
We develop an end-to-end workflow for the training and implementation of co-designed neural networks (NNs) This makes efficient NN implementations in hardware accessible to nonexperts, in a single open-sourced workflow. We demonstrate the workflow in a particle physics application involving trigger decisions that must operate at the 40 MHz collision rate of the Large Hadron Collider (LHC) We implement an optimized mixed-precision NN for high-momentum particle jets in simulated LHC proton-proton collisions.
arXiv Detail & Related papers (2023-04-13T18:00:01Z)
FireFly: A High-Throughput Hardware Accelerator for Spiking Neural Networks with Efficient DSP and Memory Optimization [6.966706170499345]
Spiking neural networks (SNNs) have been widely used due to their strong biological interpretability and high energy efficiency. Most SNN hardware implementations for field-programmable gate arrays (FPGAs) cannot meet arithmetic or memory efficiency requirements. We propose an FPGA accelerator that can process spikes generated by the firing neuron on-the-fly (FireFly)
arXiv Detail & Related papers (2023-01-05T04:28:07Z)
Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency. We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z)
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics [45.666822327616046]
This work presents a novel reconfigurable architecture for Low Graph Neural Network (LL-GNN) designs for particle detectors. The LL-GNN design advances the next generation of trigger systems by enabling sophisticated algorithms to process experimental data efficiently.
arXiv Detail & Related papers (2022-09-28T12:55:35Z)
Design of High-Throughput Mixed-Precision CNN Accelerators on FPGA [0.0]
Layer-wise mixed-precision quantization allows for more efficient results while inflating the design space. We present an in-depth quantitative methodology to efficiently explore the design space considering the limited hardware resources of a given FPGA. Our resulting hardware accelerators implement truly mixed-precision operations that enable efficient execution of layer-wise and channel-wise quantized CNNs.
arXiv Detail & Related papers (2022-08-09T15:32:51Z)
FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration [71.80326738527734]
We propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations. We show that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework.
arXiv Detail & Related papers (2021-11-22T23:53:14Z)
NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function Combinational Logic [4.119948826527649]
Field-programmable gate array (FPGA)-based accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms. This paper presents NullaNet Tiny, a framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
arXiv Detail & Related papers (2021-04-07T00:16:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.