Related papers: CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning

CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning

URL: http://arxiv.org/abs/2305.03148v3
Date: Fri, 22 Dec 2023 19:22:05 GMT
Title: CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning
Authors: Sai Qian Zhang, Thierry Tambe, Nestor Cuevas, Gu-Yeon Wei, David Brooks
Abstract summary: Training AI on resource-limited devices poses significant challenges due to the demanding computing workload and the substantial memory consumption and data access required by deep neural networks (DNNs) We propose utilizing embedded dynamic random-access memory (eDRAM) as the primary storage medium for transient training data. We present a highly efficient on-device training engine named textitCAMEL, which leverages eDRAM as the primary on-chip memory.
Score: 8.339901980070616
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: On-device learning allows AI models to adapt to user data, thereby enhancing service quality on edge platforms. However, training AI on resource-limited devices poses significant challenges due to the demanding computing workload and the substantial memory consumption and data access required by deep neural networks (DNNs). To address these issues, we propose utilizing embedded dynamic random-access memory (eDRAM) as the primary storage medium for transient training data. In comparison to static random-access memory (SRAM), eDRAM provides higher storage density and lower leakage power, resulting in reduced access cost and power leakage. Nevertheless, to maintain the integrity of the stored data, periodic power-hungry refresh operations could potentially degrade system performance. To minimize the occurrence of expensive eDRAM refresh operations, it is beneficial to shorten the lifetime of stored data during the training process. To achieve this, we adopt the principles of algorithm and hardware co-design, introducing a family of reversible DNN architectures that effectively decrease data lifetime and storage costs throughout training. Additionally, we present a highly efficient on-device training engine named \textit{CAMEL}, which leverages eDRAM as the primary on-chip memory. This engine enables efficient on-device training with significantly reduced memory usage and off-chip DRAM traffic while maintaining superior training accuracy. We evaluate our CAMEL system on multiple DNNs with different datasets, demonstrating a $2.5\times$ speedup of the training process and $2.8\times$ training energy savings than the other baseline hardware platforms.

Related papers

Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term. We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents. Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z)
Efficient and accurate neural field reconstruction using resistive memory [52.68088466453264]
Traditional signal reconstruction methods on digital computers face both software and hardware challenges. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
arXiv Detail & Related papers (2024-04-15T09:33:09Z)
Random resistive memory-based deep extreme point learning machine for unified visual processing [67.51600474104171]
We propose a novel hardware-software co-design, random resistive memory-based deep extreme point learning machine (DEPLM) Our co-design system achieves huge energy efficiency improvements and training cost reduction when compared to conventional systems.
arXiv Detail & Related papers (2023-12-14T09:46:16Z)
Fast Machine Unlearning Without Retraining Through Selective Synaptic Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data. We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z)
Evaluation of STT-MRAM as a Scratchpad for Training in ML Accelerators [9.877596714655096]
Training deep neural networks (DNNs) is an extremely memory-intensive process. Spin-Transfer-Torque MRAM (STT-MRAM) offers several desirable properties for training accelerators. We show that MRAM provide up to 15-22x improvement in system level energy.
arXiv Detail & Related papers (2023-08-03T20:36:48Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
SmartSAGE: Training Large-scale Graph Neural Networks using In-Storage Processing Architectures [0.7792020418343023]
Graph neural networks (GNNs) can extract features by learning both the representation of each objects (i.e., graph nodes) and the relationship across different objects. Despite its strengths, utilizing these algorithms in a production environment faces several challenges as the number of graph nodes and edges amount to several billions to hundreds of billions scale. In this work, we first conduct a detailed characterization on a state-of-the-art, large-scale GNN training algorithm, GraphAGES. Based on the characterization, we then explore the feasibility of utilizing capacity-optimized NVM for storing
arXiv Detail & Related papers (2022-05-10T07:25:30Z)
GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent [17.798991516056454]
We present GradPIM, a processing-in-memory architecture which accelerates parameter updates of deep neural networks training. Extending DDR4 SDRAM to utilize bank-group parallelism makes our operation designs in processing-in-memory (PIM) module efficient in terms of hardware cost and performance.
arXiv Detail & Related papers (2021-02-15T12:25:26Z)
SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage. We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation. We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z)
$DA^3$: Deep Additive Attention Adaption for Memory-Efficient On-Device Multi-Domain Learning [30.53018068935323]
Large memory used for activation storage is the bottleneck that largely limits the training time and cost on edge devices. We propose Deep Additive Attention Adaption, a novel memory-efficient on-device multi-domain learning method. We validate $DA3$ on multiple datasets against state-of-the-art methods, which shows good improvement in both accuracy and training time.
arXiv Detail & Related papers (2020-12-02T18:03:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.