Related papers: ATRIA: A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing

ATRIA: A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing

URL: http://arxiv.org/abs/2105.12781v1
Date: Wed, 26 May 2021 18:36:01 GMT
Title: ATRIA: A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing
Authors: Supreeth Mysore Shivanandamurthy, Ishan. G. Thakkar, Sayed Ahmad Salehi
Abstract summary: ATRIA is a novel bit-pArallel sTochastic aRithmetic based In-DRAM Accelerator for high-speed inference of CNNs. We show that ATRIA exhibits only 3.5% drop in CNN inference accuracy and still improvements of up to 3.2x in frames-per-second (FPS) and up to 10x in efficiency.
Score: 0.5257115841810257
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: With the rapidly growing use of Convolutional Neural Networks (CNNs) in real-world applications related to machine learning and Artificial Intelligence (AI), several hardware accelerator designs for CNN inference and training have been proposed recently. In this paper, we present ATRIA, a novel bit-pArallel sTochastic aRithmetic based In-DRAM Accelerator for energy-efficient and high-speed inference of CNNs. ATRIA employs light-weight modifications in DRAM cell arrays to implement bit-parallel stochastic arithmetic based acceleration of multiply-accumulate (MAC) operations inside DRAM. ATRIA significantly improves the latency, throughput, and efficiency of processing CNN inferences by performing 16 MAC operations in only five consecutive memory operation cycles. We mapped the inference tasks of four benchmark CNNs on ATRIA to compare its performance with five state-of-the-art in-DRAM CNN accelerators from prior work. The results of our analysis show that ATRIA exhibits only 3.5% drop in CNN inference accuracy and still achieves improvements of up to 3.2x in frames-per-second (FPS) and up to 10x in efficiency (FPS/W/mm2), compared to the best-performing in-DRAM accelerator from prior work.

Related papers

iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation [49.8026360054331]
iFlame is a novel transformer-based network architecture for mesh generation. We propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms. Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance.
arXiv Detail & Related papers (2025-03-20T19:10:37Z)
PENDRAM: Enabling High-Performance and Energy-Efficient Processing of Deep Neural Networks through a Generalized DRAM Data Mapping Policy [6.85785397160228]
Convolutional Neural Networks (CNNs) have emerged as a state-of-the-art solution for solving machine learning tasks. CNN accelerators face performance- and energy-efficiency challenges due to high off-chip memory (DRAM) access latency and energy. We present PENDRAM, a novel design space exploration methodology that enables high-performance and energy-efficient CNN acceleration.
arXiv Detail & Related papers (2024-08-05T12:11:09Z)
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications [108.44482683870888]
We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements. It demonstrates exceptional performance across various tasks, including image classification, instance and semantic segmentation, and notably, image generation.
arXiv Detail & Related papers (2024-01-11T14:53:24Z)
Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks. It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping. It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z)
Performance Analysis of DNN Inference/Training with Convolution and non-Convolution Operations [5.647410731290209]
This work proposes a novel performance analysis framework, SimDIT, for general ASIC-based systolic hardware accelerator platforms. SimDIT comprehensively covers convolution and non-convolution operations of both CNN inference and training. SimDIT achieves 18X performance improvement over a generic static resource allocation for ResNet-50 inference.
arXiv Detail & Related papers (2023-06-29T08:11:36Z)
Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud [9.927754948343326]
A neural network's performance (and energy efficiency) can be bound either by computation or memory resources. The processing-in-memory (PIM) paradigm is a viable solution to accelerate memory-bound NNs. We analyze three state-of-the-art PIM architectures for NN performance and energy efficiency.
arXiv Detail & Related papers (2022-09-19T11:46:05Z)
MAC-DO: An Efficient Output-Stationary GEMM Accelerator for CNNs Using DRAM Technology [2.918940961856197]
This paper presents MAC-DO, an efficient and low-power DRAM-based in-situ accelerator. It supports a multi-bit multiply-accumulate (MAC) operation within a single cycle. A MAC-DO array efficiently can accelerate matrix multiplications based on output stationary mapping, supporting the majority of computations performed in deep neural networks (DNNs)
arXiv Detail & Related papers (2022-07-16T07:33:20Z)
Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals [11.31429464715989]
This paper presents a new PIM architecture to efficiently accelerate deep learning tasks. It is proposed to minimize the required A/D conversions with analog accumulation and neural approximated peripheral circuits. Evaluations on different benchmarks demonstrate that Neural-PIM can improve energy efficiency by 5.36x (1.73x) and speed up throughput by 3.43x (1.59x) without losing accuracy.
arXiv Detail & Related papers (2022-01-30T16:14:49Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z)
DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels. We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z)
RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks on Mobile Devices [57.877112704841366]
This paper proposes RT3D, a model compression and mobile acceleration framework for 3D CNNs. For the first time, real-time execution of 3D CNNs is achieved on off-the-shelf mobiles.
arXiv Detail & Related papers (2020-07-20T02:05:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.