Efficient Deep Learning Using Non-Volatile Memory Technology
- URL: http://arxiv.org/abs/2206.13601v1
- Date: Mon, 27 Jun 2022 19:27:57 GMT
- Title: Efficient Deep Learning Using Non-Volatile Memory Technology
- Authors: Ahmet Inci, Mehmet Meric Isgenc, Diana Marculescu
- Abstract summary: We present DeepNVM++, a comprehensive framework to characterize, model, and analyze NVM-based caches in architectures for deep learning (DL) applications.
In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared to conventional cache.
DeepNVM++ is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in
- Score: 12.866655564742889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embedded machine learning (ML) systems have now become the dominant platform
for deploying ML serving tasks and are projected to become of equal importance
for training ML models. With this comes the challenge of overall efficient
deployment, in particular low power and high throughput implementations, under
stringent memory constraints. In this context, non-volatile memory (NVM)
technologies such as STT-MRAM and SOT-MRAM have significant advantages compared
to conventional SRAM due to their non-volatility, higher cell density, and
scalability features. While prior work has investigated several architectural
implications of NVM for generic applications, in this work we present
DeepNVM++, a comprehensive framework to characterize, model, and analyze
NVM-based caches in GPU architectures for deep learning (DL) applications by
combining technology-specific circuit-level models and the actual memory
behavior of various DL workloads. DeepNVM++ relies on iso-capacity and iso-area
performance and energy models for last-level caches implemented using
conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the
iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x
energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared
to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and
SOT-MRAM provide up to 2.2x and 2.4x EDP reduction and accommodate 2.3x and
3.3x cache capacity when compared to SRAM, respectively. We also perform a
scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of
magnitude EDP reduction when compared to SRAM for large cache capacities.
DeepNVM++ is demonstrated on STT-/SOT-MRAM technologies and can be used for the
characterization, modeling, and analysis of any NVM technology for last-level
caches in GPUs for DL applications.
Related papers
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module.
B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z) - DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data
Capacity of SRAM-based Processing-In-Memory [6.367916611208411]
We propose DDC-PIM, an efficient algorithm/architecture co-design methodology that effectively doubles the equivalent data capacity.
DDC-PIM yields about $2.84times$ speedup on MobileNetV2 and $2.69times$ on EfficientNet-B0 with negligible accuracy loss.
Compared with state-of-the-art macros, DDC-PIM achieves up to $8.41times$ and $2.75times$ improvement in weight density and area efficiency, respectively.
arXiv Detail & Related papers (2023-10-31T12:49:54Z) - Evaluation of STT-MRAM as a Scratchpad for Training in ML Accelerators [9.877596714655096]
Training deep neural networks (DNNs) is an extremely memory-intensive process.
Spin-Transfer-Torque MRAM (STT-MRAM) offers several desirable properties for training accelerators.
We show that MRAM provide up to 15-22x improvement in system level energy.
arXiv Detail & Related papers (2023-08-03T20:36:48Z) - TL-nvSRAM-CIM: Ultra-High-Density Three-Level ReRAM-Assisted
Computing-in-nvSRAM with DC-Power Free Restore and Ternary MAC Operations [8.669532093397065]
This work proposes an ultra-high-density three-level ReRAMs-assisted computing scheme for large NN models.
The proposed TL-nvSRAM-CIM achieves 7.8x higher storage density, compared with the state-art works.
arXiv Detail & Related papers (2023-07-06T01:46:06Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - Memory-Oriented Design-Space Exploration of Edge-AI Hardware for XR
Applications [5.529817156718514]
Low-Power Edge-AI capabilities are essential for on-device extended reality (XR) applications to support the vision of Metaverse.
In this work, we investigate two representative XR workloads: (i) Hand detection and (ii) Eye segmentation, for hardware design space exploration.
For both applications, we train deep neural networks and analyze the impact of quantization and hardware specific bottlenecks.
The impact of integrating state-of-the-art emerging non-volatile memory technology (STT/SOT/VGSOT MRAM) into the XR-AI inference pipeline is evaluated.
arXiv Detail & Related papers (2022-06-08T11:18:02Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z) - DeepNVM++: Cross-Layer Modeling and Optimization Framework of
Non-Volatile Memories for Deep Learning [11.228806840123084]
Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional technologies.
In this work we present DeepNVM++, a framework to characterize, model, and analyze NVM-based caches in deep learning (DL) applications.
arXiv Detail & Related papers (2020-12-08T16:53:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.