Related papers: MAC-DO: An Efficient Output-Stationary GEMM Accelerator for CNNs Using DRAM Technology

MAC-DO: An Efficient Output-Stationary GEMM Accelerator for CNNs Using DRAM Technology

URL: http://arxiv.org/abs/2207.07862v3
Date: Wed, 7 Feb 2024 15:40:35 GMT
Title: MAC-DO: An Efficient Output-Stationary GEMM Accelerator for CNNs Using DRAM Technology
Authors: Minki Jeong, Wanyeong Jung
Abstract summary: This paper presents MAC-DO, an efficient and low-power DRAM-based in-situ accelerator. It supports a multi-bit multiply-accumulate (MAC) operation within a single cycle. A MAC-DO array efficiently can accelerate matrix multiplications based on output stationary mapping, supporting the majority of computations performed in deep neural networks (DNNs)
Score: 2.918940961856197
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: DRAM-based in-situ accelerators have shown their potential in addressing the memory wall challenge of the traditional von Neumann architecture. Such accelerators exploit charge sharing or logic circuits for simple logic operations at the DRAM subarray level. However, their throughput is limited due to low array utilization, as only a few row cells in a DRAM array participate in operations while most rows remain deactivated. Moreover, they require many cycles for more complex operations such as a multi-bit multiply-accumulate (MAC) operation, resulting in significant data access and movement and potentially worsening power efficiency. To overcome these limitations, this paper presents MAC-DO, an efficient and low-power DRAM-based in-situ accelerator. Compared to previous DRAM-based in-situ accelerators, a MAC-DO cell, consisting of two 1T1C DRAM cells (two transistors and two capacitors), innately supports a multi-bit MAC operation within a single cycle, ensuring good linearity and compatibility with existing 1T1C DRAM cells and array structures. This achievement is facilitated by a novel analog computation method utilizing charge steering. Additionally, MAC-DO enables concurrent individual MAC operations in each MAC-DO cell without idle cells, significantly improving throughput and energy efficiency. As a result, a MAC-DO array efficiently can accelerate matrix multiplications based on output stationary mapping, supporting the majority of computations performed in deep neural networks (DNNs). Furthermore, a MAC-DO array efficiently reuses three types of data (input, weight and output), minimizing data movement.

Related papers

R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs. Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
A 137.5 TOPS/W SRAM Compute-in-Memory Macro with 9-b Memory Cell-Embedded ADCs and Signal Margin Enhancement Techniques for AI Edge Applications [20.74979295607707]
CIM macro can perform 4x4-bit MAC operations and yield 9-bit signed output. Innocent discharge branches of cells are utilized to apply time-modulated MAC and 9-bit ADC readout operations.
arXiv Detail & Related papers (2023-07-12T06:20:19Z)
Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)
DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference [4.718504401468233]
PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations that have significant performance overhead and scalability issues. Our work proposes an in-SRAM digital multiplier, that uses a conventional memory to perform bit-parallel computations, leveraging multiple wordlines activation. We then introduce DAISM, an architecture leveraging this multiplier, which achieves up to two orders of magnitude higher area efficiency compared to the SOTA counterparts, with competitive energy efficiency.
arXiv Detail & Related papers (2023-05-12T10:58:21Z)
A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface [16.228299091691873]
Computing-in-memory (CiM) is a promising mitigation approach by enabling multiply-accumulate operations within the memory. This work achieves 51.2GOPS throughput and 10.3TOPS/W energy efficiency, while showing 88.6% accuracy in the CIFAR-10 dataset.
arXiv Detail & Related papers (2022-11-23T07:52:10Z)
NEON: Enabling Efficient Support for Nonlinear Operations in Resistive RAM-based Neural Network Accelerators [12.045126404373868]
Resistive Random-Access Memory (RRAM) is well-suited to accelerate neural network (NN) workloads. NEON is a novel compiler optimization to enable the end-to-end execution of the NN workload in RRAM.
arXiv Detail & Related papers (2022-11-10T17:57:35Z)
Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence. Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z)
ATRIA: A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-DRAM CNN Processing [0.5257115841810257]
ATRIA is a novel bit-pArallel sTochastic aRithmetic based In-DRAM Accelerator for high-speed inference of CNNs. We show that ATRIA exhibits only 3.5% drop in CNN inference accuracy and still improvements of up to 3.2x in frames-per-second (FPS) and up to 10x in efficiency.
arXiv Detail & Related papers (2021-05-26T18:36:01Z)
SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage. We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation. We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.