IMAC: In-memory multi-bit Multiplication andACcumulation in 6T SRAM
Array
- URL: http://arxiv.org/abs/2003.12558v1
- Date: Fri, 27 Mar 2020 17:43:19 GMT
- Title: IMAC: In-memory multi-bit Multiplication andACcumulation in 6T SRAM
Array
- Authors: Mustafa Ali, Akhilesh Jaiswal, Sangamesh Kodge, Amogh Agrawal,
Indranil Chakraborty, and Kaushik Roy
- Abstract summary: In-memory computing aims at embedding some aspects of computations inside the memory array.
We present a novel in-memory multiplication followed by accumulation operation capable of performing parallel dot products within 6T array.
The proposed system is 6.24x better in energy consumption and 9.42x better in delay.
- Score: 5.29958909018578
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: `In-memory computing' is being widely explored as a novel computing paradigm
to mitigate the well known memory bottleneck. This emerging paradigm aims at
embedding some aspects of computations inside the memory array, thereby
avoiding frequent and expensive movement of data between the compute unit and
the storage memory. In-memory computing with respect to Silicon memories has
been widely explored on various memory bit-cells. Embedding computation inside
the 6 transistor (6T) SRAM array is of special interest since it is the most
widely used on-chip memory. In this paper, we present a novel in-memory
multiplication followed by accumulation operation capable of performing
parallel dot products within 6T SRAM without any changes to the standard
bitcell. We, further, study the effect of circuit non-idealities and process
variations on the accuracy of the LeNet-5 and VGG neural network architectures
against the MNIST and CIFAR-10 datasets, respectively. The proposed in-memory
dot-product mechanism achieves 88.8% and 99% accuracy for the CIFAR-10 and
MNIST, respectively. Compared to the standard von Neumann system, the proposed
system is 6.24x better in energy consumption and 9.42x better in delay.
Related papers
- LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention.
For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs [24.066283519769968]
Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications.
We propose MEMO, a novel framework for fine-grained activation memory management.
We show that MEMO achieves an average of 2.42x and 2.26x MFU compared to Megatron-LM and DeepSpeed.
arXiv Detail & Related papers (2024-07-16T18:59:49Z) - IMBUE: In-Memory Boolean-to-CUrrent Inference ArchitecturE for Tsetlin
Machines [5.6634493664726495]
In-memory computing for Machine Learning (ML) applications remedies the von Neumann bottlenecks by organizing computation to exploit parallelism and locality.
Non-volatile memory devices such as Resistive RAM (ReRAM) offer integrated switching and storage capabilities showing promising performance for ML applications.
This paper proposes an In-Memory Boolean-to-Current Inference Architecture (IMBUE) that uses ReRAM-transistor cells to eliminate the need for such conversions.
arXiv Detail & Related papers (2023-05-22T10:55:01Z) - DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN
Training and Inference [4.718504401468233]
PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations that have significant performance overhead and scalability issues.
Our work proposes an in-SRAM digital multiplier, that uses a conventional memory to perform bit-parallel computations, leveraging multiple wordlines activation.
We then introduce DAISM, an architecture leveraging this multiplier, which achieves up to two orders of magnitude higher area efficiency compared to the SOTA counterparts, with competitive energy efficiency.
arXiv Detail & Related papers (2023-05-12T10:58:21Z) - Pex: Memory-efficient Microcontroller Deep Learning through Partial
Execution [11.336229510791481]
We discuss a novel execution paradigm for microcontroller deep learning.
It modifies the execution of neural networks to avoid materialising full buffers in memory.
This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time.
arXiv Detail & Related papers (2022-11-30T18:47:30Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - Kanerva++: extending The Kanerva Machine with differentiable, locally
block allocated latent memory [75.65949969000596]
Episodic and semantic memory are critical components of the human memory model.
We develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory.
We demonstrate that this allocation scheme improves performance in memory conditional image generation.
arXiv Detail & Related papers (2021-02-20T18:40:40Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z) - Leveraging Automated Mixed-Low-Precision Quantization for tiny edge
microcontrollers [76.30674794049293]
This paper presents an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices.
Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors.
Given an MCU-class memory bound to 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions.
arXiv Detail & Related papers (2020-08-12T06:09:58Z) - In-memory Implementation of On-chip Trainable and Scalable ANN for AI/ML
Applications [0.0]
This paper presents an in-memory computing architecture for ANN enabling artificial intelligence (AI) and machine learning (ML) applications.
Our novel on-chip training and inference in-memory architecture reduces energy cost and enhances throughput by simultaneously accessing the multiple rows of array per precharge cycle.
The proposed architecture was trained and tested on the IRIS dataset which exhibits $46times$ more energy efficient per MAC (multiply and accumulate) operation compared to earlier classifiers.
arXiv Detail & Related papers (2020-05-19T15:36:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.