Data Streaming and Traffic Gathering in Mesh-based NoC for Deep Neural
Network Acceleration
- URL: http://arxiv.org/abs/2108.02569v1
- Date: Sun, 1 Aug 2021 23:50:12 GMT
- Title: Data Streaming and Traffic Gathering in Mesh-based NoC for Deep Neural
Network Acceleration
- Authors: Binayak Tiwari, Mei Yang, Xiaohang Wang, Yingtao Jiang
- Abstract summary: We propose a modified mesh architecture with a one-way/two-way streaming bus to speedup one-to-many traffic and the use of gather packets to support many-to-one traffic.
The analysis of runtime latency of a convolutional layer shows that the two-way streaming architecture achieves better improvement than the one-way streaming architecture.
- Score: 7.455546102930911
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing popularity of deep neural network (DNN) applications demands
high computing power and efficient hardware accelerator architecture. DNN
accelerators use a large number of processing elements (PEs) and on-chip memory
for storing weights and other parameters. As the communication backbone of a
DNN accelerator, networks-on-chip (NoC) play an important role in supporting
various dataflow patterns and enabling processing with communication
parallelism in a DNN accelerator. However, the widely used mesh-based NoC
architectures inherently cannot support the efficient one-to-many and
many-to-one traffic largely existing in DNN workloads. In this paper, we
propose a modified mesh architecture with a one-way/two-way streaming bus to
speedup one-to-many (multicast) traffic, and the use of gather packets to
support many-to-one (gather) traffic. The analysis of the runtime latency of a
convolutional layer shows that the two-way streaming architecture achieves
better improvement than the one-way streaming architecture for an Output
Stationary (OS) dataflow architecture. The simulation results demonstrate that
the gather packets can help to reduce the runtime latency up to 1.8 times and
network power consumption up to 1.7 times, compared with the repetitive unicast
method on modified mesh architectures supporting two-way streaming.
Related papers
- LoAS: Fully Temporal-Parallel Datatflow for Dual-Sparse Spiking Neural Networks [14.844751188874652]
Spiking Neural Networks (SNNs) have gained significant research attention in the last decade due to their potential to drive resource-constrained edge devices.
Existing SNN accelerators offer high efficiency in processing sparse spikes with dense weights, but opportunities are less explored in SNNs with sparse weights.
We study the acceleration of dual-sparse SNNs, focusing on their core operation, sparse-matrix-sparse-matrix multiplication (spMspM)
arXiv Detail & Related papers (2024-07-19T07:02:26Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - SMOF: Streaming Modern CNNs on FPGAs with Smart Off-Chip Eviction [6.800641017055453]
This paper introduces weight and activation eviction mechanisms to off-chip memory along the computational pipeline.
The proposed mechanism is incorporated into an existing toolflow, expanding the design space by utilising off-chip memory as a buffer.
SMOF has demonstrated the capacity to deliver competitive and, in some cases, state-of-the-art performance across a spectrum of computer vision tasks.
arXiv Detail & Related papers (2024-03-27T18:12:24Z) - Core interface optimization for multi-core neuromorphic processors [5.391889175209394]
Spiking Neural Networks (SNNs) represent a promising approach to edge-computing for applications that require low-power and low-latency.
To realize large-scale and scalable SNNs it is necessary to develop an efficient asynchronous communication and routing fabric.
arXiv Detail & Related papers (2023-08-08T10:00:14Z) - Teal: Learning-Accelerated Optimization of WAN Traffic Engineering [68.7863363109948]
We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control.
To reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand.
Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625x speedups.
arXiv Detail & Related papers (2022-10-25T04:46:30Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - Improving the Performance of a NoC-based CNN Accelerator with Gather
Support [6.824747267214373]
Deep learning technology drives the need for an efficient parallel computing architecture for CNNs.
The CNN workload introduces many-to-one traffic in addition to one-to-one and one-to-many traffic.
We propose to use the gather packet on mesh-based NoCs employing output stationary systolic array in support of many-to-one traffic.
arXiv Detail & Related papers (2021-08-01T23:33:40Z) - Spatio-temporal Modeling for Large-scale Vehicular Networks Using Graph
Convolutional Networks [110.80088437391379]
A graph-based framework called SMART is proposed to model and keep track of the statistics of vehicle-to-temporal (V2I) communication latency across a large geographical area.
We develop a graph reconstruction-based approach using a graph convolutional network integrated with a deep Q-networks algorithm.
Our results show that the proposed method can significantly improve both the accuracy and efficiency for modeling and the latency performance of large vehicular networks.
arXiv Detail & Related papers (2021-03-13T06:56:29Z) - Dynamic Graph: Learning Instance-aware Connectivity for Neural Networks [78.65792427542672]
Dynamic Graph Network (DG-Net) is a complete directed acyclic graph, where the nodes represent convolutional blocks and the edges represent connection paths.
Instead of using the same path of the network, DG-Net aggregates features dynamically in each node, which allows the network to have more representation ability.
arXiv Detail & Related papers (2020-10-02T16:50:26Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.