Evaluation of STT-MRAM as a Scratchpad for Training in ML Accelerators
- URL: http://arxiv.org/abs/2308.02024v1
- Date: Thu, 3 Aug 2023 20:36:48 GMT
- Title: Evaluation of STT-MRAM as a Scratchpad for Training in ML Accelerators
- Authors: Sourjya Roy, Cheng Wang, and Anand Raghunathan
- Abstract summary: Training deep neural networks (DNNs) is an extremely memory-intensive process.
Spin-Transfer-Torque MRAM (STT-MRAM) offers several desirable properties for training accelerators.
We show that MRAM provide up to 15-22x improvement in system level energy.
- Score: 9.877596714655096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Progress in artificial intelligence and machine learning over the past decade
has been driven by the ability to train larger deep neural networks (DNNs),
leading to a compute demand that far exceeds the growth in hardware performance
afforded by Moore's law. Training DNNs is an extremely memory-intensive
process, requiring not just the model weights but also activations and
gradients for an entire minibatch to be stored. The need to provide
high-density and low-leakage on-chip memory motivates the exploration of
emerging non-volatile memory for training accelerators. Spin-Transfer-Torque
MRAM (STT-MRAM) offers several desirable properties for training accelerators,
including 3-4x higher density than SRAM, significantly reduced leakage power,
high endurance and reasonable access time. On the one hand, MRAM write
operations require high write energy and latency due to the need to ensure
reliable switching.
In this study, we perform a comprehensive device-to-system evaluation and
co-optimization of STT-MRAM for efficient ML training accelerator design. We
devised a cross-layer simulation framework to evaluate the effectiveness of
STT-MRAM as a scratchpad replacing SRAM in a systolic-array-based DNN
accelerator. To address the inefficiency of writes in STT-MRAM, we propose to
reduce write voltage and duration. To evaluate the ensuing accuracy-efficiency
trade-off, we conduct a thorough analysis of the error tolerance of input
activations, weights, and errors during the training. We propose heterogeneous
memory configurations that enable training convergence with good accuracy. We
show that MRAM provide up to 15-22x improvement in system level energy across a
suite of DNN benchmarks under iso-capacity and iso-area scenarios. Further
optimizing STT-MRAM write operations can provide over 2x improvement in write
energy for minimal degradation in application-level training accuracy.
Related papers
- Random resistive memory-based deep extreme point learning machine for
unified visual processing [67.51600474104171]
We propose a novel hardware-software co-design, random resistive memory-based deep extreme point learning machine (DEPLM)
Our co-design system achieves huge energy efficiency improvements and training cost reduction when compared to conventional systems.
arXiv Detail & Related papers (2023-12-14T09:46:16Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device
Learning [8.339901980070616]
Training AI on resource-limited devices poses significant challenges due to the demanding computing workload and the substantial memory consumption and data access required by deep neural networks (DNNs)
We propose utilizing embedded dynamic random-access memory (eDRAM) as the primary storage medium for transient training data.
We present a highly efficient on-device training engine named textitCAMEL, which leverages eDRAM as the primary on-chip memory.
arXiv Detail & Related papers (2023-05-04T20:57:01Z) - Efficient Deep Learning Using Non-Volatile Memory Technology [12.866655564742889]
We present DeepNVM++, a comprehensive framework to characterize, model, and analyze NVM-based caches in architectures for deep learning (DL) applications.
In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared to conventional cache.
DeepNVM++ is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in
arXiv Detail & Related papers (2022-06-27T19:27:57Z) - Braille Letter Reading: A Benchmark for Spatio-Temporal Pattern
Recognition on Neuromorphic Hardware [50.380319968947035]
Recent deep learning approaches have reached accuracy in such tasks, but their implementation on conventional embedded solutions is still computationally very and energy expensive.
We propose a new benchmark for computing tactile pattern recognition at the edge through letters reading.
We trained and compared feed-forward and recurrent spiking neural networks (SNNs) offline using back-propagation through time with surrogate gradients, then we deployed them on the Intel Loihimorphic chip for efficient inference.
Our results show that the LSTM outperforms the recurrent SNN in terms of accuracy by 14%. However, the recurrent SNN on Loihi is 237 times more energy
arXiv Detail & Related papers (2022-05-30T14:30:45Z) - Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of
Peripherals [11.31429464715989]
This paper presents a new PIM architecture to efficiently accelerate deep learning tasks.
It is proposed to minimize the required A/D conversions with analog accumulation and neural approximated peripheral circuits.
Evaluations on different benchmarks demonstrate that Neural-PIM can improve energy efficiency by 5.36x (1.73x) and speed up throughput by 3.43x (1.59x) without losing accuracy.
arXiv Detail & Related papers (2022-01-30T16:14:49Z) - MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the
Edge [72.16021611888165]
This paper proposes a novel Memory-Economic Sparse Training (MEST) framework targeting for accurate and fast execution on edge devices.
The proposed MEST framework consists of enhancements by Elastic Mutation (EM) and Soft Memory Bound (&S)
Our results suggest that unforgettable examples can be identified in-situ even during the dynamic exploration of sparsity masks.
arXiv Detail & Related papers (2021-10-26T21:15:17Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z) - DeepNVM++: Cross-Layer Modeling and Optimization Framework of
Non-Volatile Memories for Deep Learning [11.228806840123084]
Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional technologies.
In this work we present DeepNVM++, a framework to characterize, model, and analyze NVM-based caches in deep learning (DL) applications.
arXiv Detail & Related papers (2020-12-08T16:53:25Z) - One-step regression and classification with crosspoint resistive memory
arrays [62.997667081978825]
High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge.
One-step learning is supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition.
Results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.
arXiv Detail & Related papers (2020-05-05T08:00:07Z) - A New MRAM-based Process In-Memory Accelerator for Efficient Neural
Network Training with Floating Point Precision [28.458719513745812]
We propose a spin orbit torque magnetic random access memory (SOT-MRAM) based digital PIM accelerator that supports floating point precision.
Experiment results show that the proposed SOT-MRAM PIM based DNN training accelerator can achieve 3.3$times$, 1.8$times$, and 2.5$times$ improvement in terms of energy, latency, and area.
arXiv Detail & Related papers (2020-03-02T04:58:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.