Dynamic Gradient Sparse Update for Edge Training
- URL: http://arxiv.org/abs/2503.17959v1
- Date: Sun, 23 Mar 2025 06:32:12 GMT
- Title: Dynamic Gradient Sparse Update for Edge Training
- Authors: I-Hsuan Li, Tian-Sheuan Chang,
- Abstract summary: gradient computation for backpropagation in the training requires significant memory buffers to store intermediate features and compute losses.<n>This is unacceptable for memory-constrained edge devices such as microcontrollers.<n>We propose a training acceleration method using dynamic gradient sparse updates to reduce memory usage.
- Score: 0.0502254944841629
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Training on edge devices enables personalized model fine-tuning to enhance real-world performance and maintain data privacy. However, the gradient computation for backpropagation in the training requires significant memory buffers to store intermediate features and compute losses. This is unacceptable for memory-constrained edge devices such as microcontrollers. To tackle this issue, we propose a training acceleration method using dynamic gradient sparse updates. This method updates the important channels and layers only and skips gradient computation for the less important channels and layers to reduce memory usage for each update iteration. In addition, the channel selection is dynamic for different iterations to traverse most of the parameters in the update layers along the time dimension for better performance. The experimental result shows that the proposed method enables an ImageNet pre-trained MobileNetV2 trained on CIFAR-10 to achieve an accuracy of 85.77\% while updating only 2\% of convolution weights within 256KB on-chip memory. This results in a remarkable 98\% reduction in feature memory usage compared to dense model training.
Related papers
- Stepping Forward on the Last Mile [8.756033984943178]
We propose a series of algorithm enhancements that further reduce the memory footprint, and the accuracy gap compared to backpropagation.
Our results demonstrate that on the last mile of model customization on edge devices, training with fixed-point forward gradients is a feasible and practical approach.
arXiv Detail & Related papers (2024-11-06T16:33:21Z) - Block Selective Reprogramming for On-device Training of Vision Transformers [12.118303034660531]
We present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model.
Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x.
arXiv Detail & Related papers (2024-03-25T08:41:01Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge [27.533985670823945]
TinyTrain is an on-device training approach that drastically reduces training time by selectively updating parts of the model.
TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0% in accuracy.
It achieves 9.5x faster and 3.5x more energy-efficient training over status-quo approaches.
arXiv Detail & Related papers (2023-07-19T13:49:12Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy.
At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy.
This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Software-Level Accuracy Using Stochastic Computing With
Charge-Trap-Flash Based Weight Matrix [2.580765958706854]
Charge Trap Flash (CTF) memory was shown to have a large number of levels before saturation, but variable non-linearity.
We show, through simulations, that at an optimum choice of the range, our system performs nearly as well as the models trained using exact floating point operations.
We also show its use in reinforcement learning, where it is used for value function approximation in Q-Learning, and learns to complete an episode the mountain car control problem in around 146 steps.
arXiv Detail & Related papers (2020-03-09T02:45:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.