RAMAN: A Re-configurable and Sparse tinyML Accelerator for Inference on
Edge
- URL: http://arxiv.org/abs/2306.06493v1
- Date: Sat, 10 Jun 2023 17:25:58 GMT
- Title: RAMAN: A Re-configurable and Sparse tinyML Accelerator for Inference on
Edge
- Authors: Adithya Krishna, Srikanth Rohit Nudurupati, Chandana D G, Pritesh
Dwivedi, Andr\'e van Schaik, Mahesh Mehendale and Chetan Singh Thakur
- Abstract summary: Deep Neural Network (DNN) based inference at the edge is challenging as these compute and data-intensive algorithms need to be implemented at low cost and low power.
We present RAMAN, a Re-configurable and spArse tinyML Accelerator for infereNce on edge, architected to exploit the sparsity to reduce area (storage), power as well as latency.
- Score: 1.8293684411977293
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Deep Neural Network (DNN) based inference at the edge is challenging as these
compute and data-intensive algorithms need to be implemented at low cost and
low power while meeting the latency constraints of the target applications.
Sparsity, in both activations and weights inherent to DNNs, is a key knob to
leverage. In this paper, we present RAMAN, a Re-configurable and spArse tinyML
Accelerator for infereNce on edge, architected to exploit the sparsity to
reduce area (storage), power as well as latency. RAMAN can be configured to
support a wide range of DNN topologies - consisting of different convolution
layer types and a range of layer parameters (feature-map size and the number of
channels). RAMAN can also be configured to support accuracy vs power/latency
tradeoffs using techniques deployed at compile-time and run-time. We present
the salient features of the architecture, provide implementation results and
compare the same with the state-of-the-art. RAMAN employs novel dataflow
inspired by Gustavson's algorithm that has optimal input activation (IA) and
output activation (OA) reuse to minimize memory access and the overall data
movement cost. The dataflow allows RAMAN to locally reduce the partial sum
(Psum) within a processing element array to eliminate the Psum writeback
traffic. Additionally, we suggest a method to reduce peak activation memory by
overlapping IA and OA on the same memory space, which can reduce storage
requirements by up to 50%. RAMAN was implemented on a low-power and
resource-constrained Efinix Ti60 FPGA with 37.2K LUTs and 8.6K register
utilization. RAMAN processes all layers of the MobileNetV1 model at 98.47
GOp/s/W and the DS-CNN model at 79.68 GOp/s/W by leveraging both weight and
activation sparsity.
Related papers
- RNC: Efficient RRAM-aware NAS and Compilation for DNNs on Resource-Constrained Edge Devices [0.30458577208819987]
We aim to develop edge-friendly deep neural networks (DNNs) for accelerators based on resistive random-access memory (RRAM)
We propose an edge compilation and resource-constrained RRAM-aware neural architecture search (NAS) framework to search for optimized neural networks meeting specific hardware constraints.
The resulting model from NAS optimized for speed achieved 5x-30x speedup.
arXiv Detail & Related papers (2024-09-27T15:35:36Z) - A Configurable and Efficient Memory Hierarchy for Neural Network Hardware Accelerator [0.6242215470795112]
We propose a memory hierarchy framework tailored for per layer adaptive memory access patterns of deep neural networks (DNNs)
The objective is to strike an optimized balance between minimizing the required memory capacity and maintaining high accelerator performance.
arXiv Detail & Related papers (2024-04-24T11:57:37Z) - Spiker+: a framework for the generation of efficient Spiking Neural
Networks FPGA accelerators for inference at the edge [49.42371633618761]
Spiker+ is a framework for generating efficient, low-power, and low-area customized Spiking Neural Networks (SNN) accelerators on FPGA for inference at the edge.
Spiker+ is tested on two benchmark datasets, the MNIST and the Spiking Heidelberg Digits (SHD)
arXiv Detail & Related papers (2024-01-02T10:42:42Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Efficient Hardware Acceleration of Sparsely Active Convolutional Spiking
Neural Networks [0.0]
Spiking Neural Networks (SNNs) compute in an event-based matter to achieve a more efficient computation than standard Neural Networks.
We propose a novel architecture that is optimized for the processing of Convolutional SNNs that feature a high degree of activation sparsity.
arXiv Detail & Related papers (2022-03-23T14:18:58Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - MAFAT: Memory-Aware Fusing and Tiling of Neural Networks for Accelerated
Edge Inference [1.7894377200944507]
Machine learning networks can easily exceed available memory, increasing latency due to excessive OS swapping.
We propose a memory usage predictor coupled with a search algorithm to provide optimized fusing and tiling configurations.
Results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints.
arXiv Detail & Related papers (2021-07-14T19:45:49Z) - Efficient Micro-Structured Weight Unification and Pruning for Neural
Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices.
Previous unstructured or structured weight pruning methods can hardly truly accelerate inference.
We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z) - Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration [14.958793135751149]
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM)
Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead.
We address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware.
arXiv Detail & Related papers (2020-09-04T20:17:42Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.