Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core
Platforms With Network-on-Chip Interconnect
- URL: http://arxiv.org/abs/2006.12274v1
- Date: Thu, 18 Jun 2020 17:13:18 GMT
- Title: Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core
Platforms With Network-on-Chip Interconnect
- Authors: Andreas Bytyn, Ren\'e Ahlsdorf, Rainer Leupers, Gerd Ascheid
- Abstract summary: Machine intelligence, especially using convolutional neural networks (CNNs), has become a large area of research over the past years.
Many-core platforms consisting of several homogeneous cores can alleviate limitations with regard to physical implementation at the expense of an increased dataflow mapping effort.
This work presents an automated mapping strategy starting at the single-core level with different optimization targets for minimal runtime and minimal off-chip memory accesses.
The strategy is then extended towards a suitable many-core mapping scheme and evaluated using a scalable system-level simulation with a network-on-chip interconnect.
- Score: 0.0764671395172401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine intelligence, especially using convolutional neural networks (CNNs),
has become a large area of research over the past years. Increasingly
sophisticated hardware accelerators are proposed that exploit e.g. the sparsity
in computations and make use of reduced precision arithmetic to scale down the
energy consumption. However, future platforms require more than just energy
efficiency: Scalability is becoming an increasingly important factor. The
required effort for physical implementation grows with the size of the
accelerator making it more difficult to meet target constraints. Using
many-core platforms consisting of several homogeneous cores can alleviate the
aforementioned limitations with regard to physical implementation at the
expense of an increased dataflow mapping effort. While the dataflow in CNNs is
deterministic and can therefore be optimized offline, the problem of finding a
suitable scheme that minimizes both runtime and off-chip memory accesses is a
challenging task which becomes even more complex if an interconnect system is
involved. This work presents an automated mapping strategy starting at the
single-core level with different optimization targets for minimal runtime and
minimal off-chip memory accesses. The strategy is then extended towards a
suitable many-core mapping scheme and evaluated using a scalable system-level
simulation with a network-on-chip interconnect. Design space exploration is
performed by mapping the well-known CNNs AlexNet and VGG-16 to platforms of
different core counts and computational power per core in order to investigate
the trade-offs. Our mapping strategy and system setup is scaled starting from
the single core level up to 128 cores, thereby showing the limits of the
selected approach.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Memory-aware Scheduling for Complex Wired Networks with Iterative Graph
Optimization [4.614780125575351]
We propose an efficient memory-aware scheduling framework based on iterative graph optimization.
Our framework features an iterative graph fusion algorithm that simplifies the graph while preserving the scheduling optimality.
arXiv Detail & Related papers (2023-08-26T14:52:02Z) - Slimmable Encoders for Flexible Split DNNs in Bandwidth and Resource
Constrained IoT Systems [12.427821850039448]
We propose a novel split computing approach based on slimmable ensemble encoders.
The key advantage of our design is the ability to adapt computational load and transmitted data size in real-time with minimal overhead and time.
Our model outperforms existing solutions in terms of compression efficacy and execution time, especially in the context of weak mobile devices.
arXiv Detail & Related papers (2023-06-22T06:33:12Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - A Design Flow for Mapping Spiking Neural Networks to Many-Core
Neuromorphic Hardware [4.527975416669432]
Many-core neuromorphic hardware is expected to execute large machine learning models.
To deal with the design complexity, a predictable design flow is needed to guarantee real-time performance.
We propose an SDFG-based design flow for mapping spiking neural networks to many-core neuromorphic hardware.
arXiv Detail & Related papers (2021-08-27T18:08:08Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - ItNet: iterative neural networks with small graphs for accurate and
efficient anytime prediction [1.52292571922932]
In this study, we introduce a class of network models that have a small memory footprint in terms of their computational graphs.
We show state-of-the-art results for semantic segmentation on the CamVid and Cityscapes datasets.
arXiv Detail & Related papers (2021-01-21T15:56:29Z) - Multi-scale Interaction for Real-time LiDAR Data Segmentation on an
Embedded Platform [62.91011959772665]
Real-time semantic segmentation of LiDAR data is crucial for autonomously driving vehicles.
Current approaches that operate directly on the point cloud use complex spatial aggregation operations.
We propose a projection-based method, called Multi-scale Interaction Network (MINet), which is very efficient and accurate.
arXiv Detail & Related papers (2020-08-20T19:06:11Z) - Optimizing Memory Placement using Evolutionary Graph Reinforcement
Learning [56.83172249278467]
We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces.
We train and validate our approach directly on the Intel NNP-I chip for inference.
We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.
arXiv Detail & Related papers (2020-07-14T18:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.