TinyTL: Reduce Activations, Not Trainable Parameters for Efficient
On-Device Learning
- URL: http://arxiv.org/abs/2007.11622v5
- Date: Sun, 6 Jun 2021 01:23:16 GMT
- Title: TinyTL: Reduce Activations, Not Trainable Parameters for Efficient
On-Device Learning
- Authors: Han Cai, Chuang Gan, Ligeng Zhu, Song Han
- Abstract summary: On-device learning enables edge devices to continually adapt the AI models to new data.
Existing work solves this problem by reducing the number of trainable parameters.
We present Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning.
- Score: 78.80707950262214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: On-device learning enables edge devices to continually adapt the AI models to
new data, which requires a small memory footprint to fit the tight memory
constraint of edge devices. Existing work solves this problem by reducing the
number of trainable parameters. However, this doesn't directly translate to
memory saving since the major bottleneck is the activations, not parameters. In
this work, we present Tiny-Transfer-Learning (TinyTL) for memory-efficient
on-device learning. TinyTL freezes the weights while only learns the bias
modules, thus no need to store the intermediate activations. To maintain the
adaptation capacity, we introduce a new memory-efficient bias module, the lite
residual module, to refine the feature extractor by learning small residual
feature maps adding only 3.8% memory overhead. Extensive experiments show that
TinyTL significantly saves the memory (up to 6.5x) with little accuracy loss
compared to fine-tuning the full network. Compared to fine-tuning the last
layer, TinyTL provides significant accuracy improvements (up to 34.1%) with
little memory overhead. Furthermore, combined with feature extractor
adaptation, TinyTL provides 7.3-12.9x memory saving without sacrificing
accuracy compared to fine-tuning the full Inception-V3.
Related papers
- Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation [29.139579820699495]
This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization.
We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions.
In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers.
arXiv Detail & Related papers (2024-06-24T03:09:15Z) - DTL: Disentangled Transfer Learning for Visual Recognition [21.549234013998255]
We introduce Disentangled Transfer Learning (DTL), which disentangles the trainable parameters from the backbone using a lightweight Compact Side Network (CSN)
The proposed method not only reduces a large amount of GPU memory usage and trainable parameters, but also outperforms existing PETL methods by a significant margin in accuracy.
arXiv Detail & Related papers (2023-12-13T02:51:26Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - MobileTL: On-device Transfer Learning with Inverted Residual Blocks [14.305834934988185]
We present MobileTL, a transfer learning method for models built with Inverted Residual Blocks (IRBs)
MobileTL trains the shifts for internal normalization layers to avoid storing activation maps for the backward pass.
Our method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs, respectively.
arXiv Detail & Related papers (2022-12-05T23:07:55Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer
Learning [82.93130407930762]
It is costly to update the entire parameter set of large pre-trained models.
PETL techniques allow updating a small subset of parameters inside a pre-trained backbone network for a new task.
We propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts.
arXiv Detail & Related papers (2022-06-13T23:51:56Z) - A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle.
We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor.
Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.