MAC-DO: An Efficient Output-Stationary GEMM Accelerator for CNNs Using
DRAM Technology
- URL: http://arxiv.org/abs/2207.07862v3
- Date: Wed, 7 Feb 2024 15:40:35 GMT
- Title: MAC-DO: An Efficient Output-Stationary GEMM Accelerator for CNNs Using
DRAM Technology
- Authors: Minki Jeong, Wanyeong Jung
- Abstract summary: This paper presents MAC-DO, an efficient and low-power DRAM-based in-situ accelerator.
It supports a multi-bit multiply-accumulate (MAC) operation within a single cycle.
A MAC-DO array efficiently can accelerate matrix multiplications based on output stationary mapping, supporting the majority of computations performed in deep neural networks (DNNs)
- Score: 2.918940961856197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DRAM-based in-situ accelerators have shown their potential in addressing the
memory wall challenge of the traditional von Neumann architecture. Such
accelerators exploit charge sharing or logic circuits for simple logic
operations at the DRAM subarray level. However, their throughput is limited due
to low array utilization, as only a few row cells in a DRAM array participate
in operations while most rows remain deactivated. Moreover, they require many
cycles for more complex operations such as a multi-bit multiply-accumulate
(MAC) operation, resulting in significant data access and movement and
potentially worsening power efficiency. To overcome these limitations, this
paper presents MAC-DO, an efficient and low-power DRAM-based in-situ
accelerator. Compared to previous DRAM-based in-situ accelerators, a MAC-DO
cell, consisting of two 1T1C DRAM cells (two transistors and two capacitors),
innately supports a multi-bit MAC operation within a single cycle, ensuring
good linearity and compatibility with existing 1T1C DRAM cells and array
structures. This achievement is facilitated by a novel analog computation
method utilizing charge steering. Additionally, MAC-DO enables concurrent
individual MAC operations in each MAC-DO cell without idle cells, significantly
improving throughput and energy efficiency. As a result, a MAC-DO array
efficiently can accelerate matrix multiplications based on output stationary
mapping, supporting the majority of computations performed in deep neural
networks (DNNs). Furthermore, a MAC-DO array efficiently reuses three types of
data (input, weight and output), minimizing data movement.
Related papers
- BDC-Occ: Binarized Deep Convolution Unit For Binarized Occupancy Network [55.21288428359509]
Existing 3D occupancy networks demand significant hardware resources, hindering the deployment of edge devices.
We propose a novel binarized deep convolution (BDC) unit that effectively enhances performance while increasing the number of binarized convolutional layers.
Our BDC-Occ model is created by applying the proposed BDC unit to binarize the existing 3D occupancy networks.
arXiv Detail & Related papers (2024-05-27T10:44:05Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - A 137.5 TOPS/W SRAM Compute-in-Memory Macro with 9-b Memory
Cell-Embedded ADCs and Signal Margin Enhancement Techniques for AI Edge
Applications [20.74979295607707]
CIM macro can perform 4x4-bit MAC operations and yield 9-bit signed output.
Innocent discharge branches of cells are utilized to apply time-modulated MAC and 9-bit ADC readout operations.
arXiv Detail & Related papers (2023-07-12T06:20:19Z) - Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN
Training and Inference [4.718504401468233]
PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations that have significant performance overhead and scalability issues.
Our work proposes an in-SRAM digital multiplier, that uses a conventional memory to perform bit-parallel computations, leveraging multiple wordlines activation.
We then introduce DAISM, an architecture leveraging this multiplier, which achieves up to two orders of magnitude higher area efficiency compared to the SOTA counterparts, with competitive energy efficiency.
arXiv Detail & Related papers (2023-05-12T10:58:21Z) - A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface [16.228299091691873]
Computing-in-memory (CiM) is a promising mitigation approach by enabling multiply-accumulate operations within the memory.
This work achieves 51.2GOPS throughput and 10.3TOPS/W energy efficiency, while showing 88.6% accuracy in the CIFAR-10 dataset.
arXiv Detail & Related papers (2022-11-23T07:52:10Z) - NEON: Enabling Efficient Support for Nonlinear Operations in Resistive
RAM-based Neural Network Accelerators [12.045126404373868]
Resistive Random-Access Memory (RRAM) is well-suited to accelerate neural network (NN) workloads.
NEON is a novel compiler optimization to enable the end-to-end execution of the NN workload in RRAM.
arXiv Detail & Related papers (2022-11-10T17:57:35Z) - Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence.
Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z) - ATRIA: A Bit-Parallel Stochastic Arithmetic Based Accelerator for
In-DRAM CNN Processing [0.5257115841810257]
ATRIA is a novel bit-pArallel sTochastic aRithmetic based In-DRAM Accelerator for high-speed inference of CNNs.
We show that ATRIA exhibits only 3.5% drop in CNN inference accuracy and still improvements of up to 3.2x in frames-per-second (FPS) and up to 10x in efficiency.
arXiv Detail & Related papers (2021-05-26T18:36:01Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.