Related papers: Poor Man's Training on MCUs: A Memory-Efficient Quantized Back-Propagation-Free Approach

Poor Man's Training on MCUs: A Memory-Efficient Quantized Back-Propagation-Free Approach

URL: http://arxiv.org/abs/2411.05873v1
Date: Thu, 07 Nov 2024 22:42:57 GMT
Title: Poor Man's Training on MCUs: A Memory-Efficient Quantized Back-Propagation-Free Approach
Authors: Yequan Zhao, Hai Li, Ian Young, Zheng Zhang,
Abstract summary: Back propagation (BP) is the default solution for gradient computation in neural network training. implementing BP-based training on various edge devices such as FPGA, microcontrollers (MCUs) and analog computing platforms face multiple challenges. This paper presents a simple BP-free training scheme on an MCU, which makes edge training hardware design as easy as inference hardware design.
Score: 9.199493064055586
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Back propagation (BP) is the default solution for gradient computation in neural network training. However, implementing BP-based training on various edge devices such as FPGA, microcontrollers (MCUs), and analog computing platforms face multiple major challenges, such as the lack of hardware resources, long time-to-market, and dramatic errors in a low-precision setting. This paper presents a simple BP-free training scheme on an MCU, which makes edge training hardware design as easy as inference hardware design. We adopt a quantized zeroth-order method to estimate the gradients of quantized model parameters, which can overcome the error of a straight-through estimator in a low-precision BP scheme. We further employ a few dimension reduction methods (e.g., node perturbation, sparse training) to improve the convergence of zeroth-order training. Experiment results show that our BP-free training achieves comparable performance as BP-based training on adapting a pre-trained image classifier to various corrupted data on resource-constrained edge devices (e.g., an MCU with 1024-KB SRAM for dense full-model training, or an MCU with 256-KB SRAM for sparse training). This method is most suitable for application scenarios where memory cost and time-to-market are the major concerns, but longer latency can be tolerated.

Related papers

Stepping Forward on the Last Mile [8.756033984943178]
We propose a series of algorithm enhancements that further reduce the memory footprint, and the accuracy gap compared to backpropagation. Our results demonstrate that on the last mile of model customization on edge devices, training with fixed-point forward gradients is a feasible and practical approach.
arXiv Detail & Related papers (2024-11-06T16:33:21Z)
When Foresight Pruning Meets Zeroth-Order Optimization: Efficient Federated Learning for Low-Memory Devices [36.23767349592602]
Federated Learning (FL) enables collaborative learning in Artificial Intelligence of Things (AIoT) design. FL fails to work on low-memory AIoT devices due to its heavy memory usage. We propose a federated foresight pruning method based on Neural Tangent Kernel (NTK), which can seamlessly integrate with federated BP-Free training frameworks.
arXiv Detail & Related papers (2024-05-08T02:24:09Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
Tensor-Compressed Back-Propagation-Free Training for (Physics-Informed) Neural Networks [15.188785164091987]
Backward propagation (BP) is widely used to compute the gradients in neural network training. It is hard to implement BP on edge devices due to the lack of hardware and software resources to support automatic differentiation. This paper presents a completely BP-free framework that only requires forward propagation to train realistic neural networks.
arXiv Detail & Related papers (2023-08-18T23:56:50Z)
Aggregating Capacity in FL through Successive Layer Training for Computationally-Constrained Devices [3.4530027457862]
Federated learning (FL) is usually performed on resource-constrained edge devices. FL training process should be adjusted to such constraints. We propose a new method that enables successive freezing and training of the parameters of the FL model at devices.
arXiv Detail & Related papers (2023-05-26T15:04:06Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
BackLink: Supervised Local Training with Backward Links [2.104758015212034]
This work proposes a novel local training algorithm, BackLink, which introduces inter- module backward dependency and allows errors to flow between modules. Our method can lead up to a 79% reduction in memory cost and 52% in simulation runtime in ResNet110 compared to the standard BP.
arXiv Detail & Related papers (2022-05-14T21:49:47Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Towards Efficient Post-training Quantization of Pre-trained Language Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues. Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z)
GDP: Stabilized Neural Network Pruning via Gates with Differentiable Polarization [84.57695474130273]
Gate-based or importance-based pruning methods aim to remove channels whose importance is smallest. GDP can be plugged before convolutional layers without bells and whistles, to control the on-and-off of each channel. Experiments conducted over CIFAR-10 and ImageNet datasets show that the proposed GDP achieves the state-of-the-art performance.
arXiv Detail & Related papers (2021-09-06T03:17:10Z)
Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.