Related papers: A Design Flow for Mapping Spiking Neural Networks to Many-Core Neuromorphic Hardware

A Design Flow for Mapping Spiking Neural Networks to Many-Core Neuromorphic Hardware

URL: http://arxiv.org/abs/2108.12444v1
Date: Fri, 27 Aug 2021 18:08:08 GMT
Title: A Design Flow for Mapping Spiking Neural Networks to Many-Core Neuromorphic Hardware
Authors: Shihao Song, M. Lakshmi Varshika, Anup Das, and Nagarajan Kandasamy
Abstract summary: Many-core neuromorphic hardware is expected to execute large machine learning models. To deal with the design complexity, a predictable design flow is needed to guarantee real-time performance. We propose an SDFG-based design flow for mapping spiking neural networks to many-core neuromorphic hardware.
Score: 4.527975416669432
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The design of many-core neuromorphic hardware is getting more and more complex as these systems are expected to execute large machine learning models. To deal with the design complexity, a predictable design flow is needed to guarantee real-time performance such as latency and throughput without significantly increasing the buffer requirement of computing cores. Synchronous Data Flow Graphs (SDFGs) are used for predictable mapping of streaming applications to multiprocessor systems. We propose an SDFG-based design flow for mapping spiking neural networks (SNNs) to many-core neuromorphic hardware with the objective of exploring the tradeoff between throughput and buffer size. The proposed design flow integrates an iterative partitioning approach, based on Kernighan-Lin graph partitioning heuristic, creating SNN clusters such that each cluster can be mapped to a core of the hardware. The partitioning approach minimizes the inter-cluster spike communication, which improves latency on the shared interconnect of the hardware. Next, the design flow maps clusters to cores using an instance of the Particle Swarm Optimization (PSO), an evolutionary algorithm, exploring the design space of throughput and buffer size. Pareto optimal mappings are retained from the design flow, allowing system designers to select a Pareto mapping that satisfies throughput and buffer size requirements of the design. We evaluated the design flow using five large-scale convolutional neural network (CNN) models. Results demonstrate 63% higher maximum throughput and 10% lower buffer size requirement compared to state-of-the-art dataflow-based mapping solutions.

Related papers

Core Placement Optimization of Many-core Brain-Inspired Near-Storage Systems for Spiking Neural Network Training [21.75341703605822]
We propose a SNN training many-core deployment optimization method based on Off-policy Deterministic Actor-Critic. We update the parameters of the policy network through near-end policy optimization to achieve deployment optimization of SNN models in the many-core near-memory computing architecture. Our method overcomes the problems such as uneven computation and storage loads between cores, and the formation of local communication hotspots.
arXiv Detail & Related papers (2024-11-29T01:46:30Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes [11.681245043617848]
Convolutional neural networks (CNNs) with large kernels have demonstrated impressive performance in various vision-based applications. An FPGA-based inference accelerator is proposed for the efficient deployment of CNNs with arbitrary kernel sizes. The proposed hardware accelerator, evaluated on Intel Arria 10 FPGA, achieves up to 3.91 times better DSP efficiency than prior art on the same network.
arXiv Detail & Related papers (2024-02-22T05:52:55Z)
Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition [7.221206118679026]
We propose a framework for mapping CNNs onto FPGAs based on a novel tensor decomposition method called Mixed-TD. The proposed method applies layer-specific Singular Value Decomposition (SVD) and Canonical Polyadic Decomposition (CPD) in a mixed manner, achieving 1.73x to 10.29x throughput per DSP to state-of-the-art CNNs.
arXiv Detail & Related papers (2023-06-08T08:16:38Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
FlowNAS: Neural Architecture Search for Optical Flow Estimation [65.44079917247369]
We propose a neural architecture search method named FlowNAS to automatically find the better encoder architecture for flow estimation task. Experimental results show that the discovered architecture with the weights inherited from the super-network achieves 4.67% F1-all error on KITTI.
arXiv Detail & Related papers (2022-07-04T09:05:25Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
N3H-Core: Neuron-designed Neural Network Accelerator via FPGA-based Heterogeneous Computing Cores [26.38812379700231]
We develop an FPGA-based heterogeneous computing system for neural network acceleration. The proposed accelerator consists of DSP- and LUT-based GEneral Matrix-Multiplication (GEMM) computing cores. Our design outperforms the state-of-the-art Mix&Match design with latency reduced by 1.12-1.32x with higher inference accuracy.
arXiv Detail & Related papers (2021-12-15T15:12:00Z)
Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core Platforms With Network-on-Chip Interconnect [0.0764671395172401]
Machine intelligence, especially using convolutional neural networks (CNNs), has become a large area of research over the past years. Many-core platforms consisting of several homogeneous cores can alleviate limitations with regard to physical implementation at the expense of an increased dataflow mapping effort. This work presents an automated mapping strategy starting at the single-core level with different optimization targets for minimal runtime and minimal off-chip memory accesses. The strategy is then extended towards a suitable many-core mapping scheme and evaluated using a scalable system-level simulation with a network-on-chip interconnect.
arXiv Detail & Related papers (2020-06-18T17:13:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.