On the Impact of Partial Sums on Interconnect Bandwidth and Memory
Accesses in a DNN Accelerator
- URL: http://arxiv.org/abs/2011.00850v1
- Date: Mon, 2 Nov 2020 09:44:50 GMT
- Title: On the Impact of Partial Sums on Interconnect Bandwidth and Memory
Accesses in a DNN Accelerator
- Authors: Mahesh Chandra
- Abstract summary: Dedicated accelerators are being designed to address the huge resource requirement of the deep neural network (DNN) applications.
In this paper, we propose a first order analytical method to partition the feature maps for optimal bandwidth.
It is shown that the optimal partitioning and active memory controller can achieve up to 40% bandwidth reduction.
- Score: 5.429955391775968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dedicated accelerators are being designed to address the huge resource
requirement of the deep neural network (DNN) applications. The power,
performance and area (PPA) constraints limit the number of MACs available in
these accelerators. The convolution layers which require huge number of MACs
are often partitioned into multiple iterative sub-tasks. This puts huge
pressure on the available system resources such as interconnect and memory
bandwidth. The optimal partitioning of the feature maps for these sub-tasks can
reduce the bandwidth requirement substantially. Some accelerators avoid
off-chip or interconnect transfers by implementing local memories; however, the
memory accesses are still performed and a reduced bandwidth can help in saving
power in such architectures. In this paper, we propose a first order analytical
method to partition the feature maps for optimal bandwidth and evaluate the
impact of such partitioning on the bandwidth. This bandwidth can be saved by
designing an active memory controller which can perform basic arithmetic
operations. It is shown that the optimal partitioning and active memory
controller can achieve up to 40% bandwidth reduction.
Related papers
- COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators [6.172271429579593]
We propose a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators.
We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip.
arXiv Detail & Related papers (2025-01-12T11:31:25Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Fast Matrix Multiplications for Lookup Table-Quantized LLMs [58.11584672945781]
FLUTE is a flexible lookup table engine for LUT-quantized LLMs.
At batch sizes 32 and quantization group size of 128, the FLUTE kernel can be 2-4x faster than existing GEMM kernels.
arXiv Detail & Related papers (2024-07-15T17:55:42Z) - A Configurable and Efficient Memory Hierarchy for Neural Network Hardware Accelerator [0.6242215470795112]
We propose a memory hierarchy framework tailored for per layer adaptive memory access patterns of deep neural networks (DNNs)
The objective is to strike an optimized balance between minimizing the required memory capacity and maintaining high accelerator performance.
arXiv Detail & Related papers (2024-04-24T11:57:37Z) - RAMAN: A Re-configurable and Sparse tinyML Accelerator for Inference on
Edge [1.8293684411977293]
Deep Neural Network (DNN) based inference at the edge is challenging as these compute and data-intensive algorithms need to be implemented at low cost and low power.
We present RAMAN, a Re-configurable and spArse tinyML Accelerator for infereNce on edge, architected to exploit the sparsity to reduce area (storage), power as well as latency.
arXiv Detail & Related papers (2023-06-10T17:25:58Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated
Edge Inference [1.7894377200944507]
Machine learning networks can easily exceed available memory, increasing latency due to excessive OS swapping.
We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations.
Results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints.
arXiv Detail & Related papers (2021-07-14T19:45:49Z) - ATTACC the Quadratic Bottleneck of Attention Layers [3.2741800634280245]
This paper introduces a new attention-tailored dataflow, termed FLAT, for deep neural network (DNN) accelerators.
It increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer.
In our evaluation, ATTACC achieves 1.94x and 1.76x speedup and 49% and 42% of energy reduction compared to state-of-the-art edge and cloud accelerators.
arXiv Detail & Related papers (2021-07-13T22:23:40Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z) - Caching Placement and Resource Allocation for Cache-Enabling UAV NOMA
Networks [87.6031308969681]
This article investigates the cache-enabling unmanned aerial vehicle (UAV) cellular networks with massive access capability supported by non-orthogonal multiple access (NOMA)
We formulate the long-term caching placement and resource allocation optimization problem for content delivery delay minimization as a Markov decision process (MDP)
We propose a Q-learning based caching placement and resource allocation algorithm, where the UAV learns and selects action with emphsoft $varepsilon$-greedy strategy to search for the optimal match between actions and states.
arXiv Detail & Related papers (2020-08-12T08:33:51Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.