Data Streaming and Traffic Gathering in Mesh-based NoC for Deep Neural
Network Acceleration
- URL: http://arxiv.org/abs/2108.02569v1
- Date: Sun, 1 Aug 2021 23:50:12 GMT
- Title: Data Streaming and Traffic Gathering in Mesh-based NoC for Deep Neural
Network Acceleration
- Authors: Binayak Tiwari, Mei Yang, Xiaohang Wang, Yingtao Jiang
- Abstract summary: We propose a modified mesh architecture with a one-way/two-way streaming bus to speedup one-to-many traffic and the use of gather packets to support many-to-one traffic.
The analysis of runtime latency of a convolutional layer shows that the two-way streaming architecture achieves better improvement than the one-way streaming architecture.
- Score: 7.455546102930911
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing popularity of deep neural network (DNN) applications demands
high computing power and efficient hardware accelerator architecture. DNN
accelerators use a large number of processing elements (PEs) and on-chip memory
for storing weights and other parameters. As the communication backbone of a
DNN accelerator, networks-on-chip (NoC) play an important role in supporting
various dataflow patterns and enabling processing with communication
parallelism in a DNN accelerator. However, the widely used mesh-based NoC
architectures inherently cannot support the efficient one-to-many and
many-to-one traffic largely existing in DNN workloads. In this paper, we
propose a modified mesh architecture with a one-way/two-way streaming bus to
speedup one-to-many (multicast) traffic, and the use of gather packets to
support many-to-one (gather) traffic. The analysis of the runtime latency of a
convolutional layer shows that the two-way streaming architecture achieves
better improvement than the one-way streaming architecture for an Output
Stationary (OS) dataflow architecture. The simulation results demonstrate that
the gather packets can help to reduce the runtime latency up to 1.8 times and
network power consumption up to 1.7 times, compared with the repetitive unicast
method on modified mesh architectures supporting two-way streaming.
Related papers
- DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort.
DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives.
For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z) - HYDRA: Hybrid Data Multiplexing and Run-time Layer Configurable DNN Accelerator [0.0]
The article proposes a layer-multiplexed approach, which further reuses a single activation function within the execution of a single layer with improved Fused-Multiply-Accumulate (FMA)
The proposed architectures achieve reductions over 90% of power consumption and resource utilization improvements, with 35.21 TOPSW.
arXiv Detail & Related papers (2024-09-08T05:10:02Z) - TrIM: Triangular Input Movement Systolic Array for Convolutional Neural Networks -- Part II: Architecture and Hardware Implementation [0.0]
TrIM is an innovative dataflow based on a triangular movement of inputs.
TrIM can reduce the number of memory accesses by one order of magnitude when compared to state-of-the-art systolic arrays.
architecture achieves a peak throughput of 453.6 Giga Operations per Second.
arXiv Detail & Related papers (2024-08-05T10:18:00Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Core interface optimization for multi-core neuromorphic processors [5.391889175209394]
Spiking Neural Networks (SNNs) represent a promising approach to edge-computing for applications that require low-power and low-latency.
To realize large-scale and scalable SNNs it is necessary to develop an efficient asynchronous communication and routing fabric.
arXiv Detail & Related papers (2023-08-08T10:00:14Z) - Teal: Learning-Accelerated Optimization of WAN Traffic Engineering [68.7863363109948]
We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control.
To reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand.
Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625x speedups.
arXiv Detail & Related papers (2022-10-25T04:46:30Z) - Improving the Performance of a NoC-based CNN Accelerator with Gather
Support [6.824747267214373]
Deep learning technology drives the need for an efficient parallel computing architecture for CNNs.
The CNN workload introduces many-to-one traffic in addition to one-to-one and one-to-many traffic.
We propose to use the gather packet on mesh-based NoCs employing output stationary systolic array in support of many-to-one traffic.
arXiv Detail & Related papers (2021-08-01T23:33:40Z) - Spatio-temporal Modeling for Large-scale Vehicular Networks Using Graph
Convolutional Networks [110.80088437391379]
A graph-based framework called SMART is proposed to model and keep track of the statistics of vehicle-to-temporal (V2I) communication latency across a large geographical area.
We develop a graph reconstruction-based approach using a graph convolutional network integrated with a deep Q-networks algorithm.
Our results show that the proposed method can significantly improve both the accuracy and efficiency for modeling and the latency performance of large vehicular networks.
arXiv Detail & Related papers (2021-03-13T06:56:29Z) - Dynamic Graph: Learning Instance-aware Connectivity for Neural Networks [78.65792427542672]
Dynamic Graph Network (DG-Net) is a complete directed acyclic graph, where the nodes represent convolutional blocks and the edges represent connection paths.
Instead of using the same path of the network, DG-Net aggregates features dynamically in each node, which allows the network to have more representation ability.
arXiv Detail & Related papers (2020-10-02T16:50:26Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.