MobileTL: On-device Transfer Learning with Inverted Residual Blocks
- URL: http://arxiv.org/abs/2212.03246v2
- Date: Sat, 8 Apr 2023 16:47:30 GMT
- Title: MobileTL: On-device Transfer Learning with Inverted Residual Blocks
- Authors: Hung-Yueh Chiang, Natalia Frumkin, Feng Liang, Diana Marculescu
- Abstract summary: We present MobileTL, a transfer learning method for models built with Inverted Residual Blocks (IRBs)
MobileTL trains the shifts for internal normalization layers to avoid storing activation maps for the backward pass.
Our method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs, respectively.
- Score: 14.305834934988185
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transfer learning on edge is challenging due to on-device limited resources.
Existing work addresses this issue by training a subset of parameters or adding
model patches. Developed with inference in mind, Inverted Residual Blocks
(IRBs) split a convolutional layer into depthwise and pointwise convolutions,
leading to more stacking layers, e.g., convolution, normalization, and
activation layers. Though they are efficient for inference, IRBs require that
additional activation maps are stored in memory for training weights for
convolution layers and scales for normalization layers. As a result, their high
memory cost prohibits training IRBs on resource-limited edge devices, and
making them unsuitable in the context of transfer learning. To address this
issue, we present MobileTL, a memory and computationally efficient on-device
transfer learning method for models built with IRBs. MobileTL trains the shifts
for internal normalization layers to avoid storing activation maps for the
backward pass. Also, MobileTL approximates the backward computation of the
activation layer (e.g., Hard-Swish and ReLU6) as a signed function which
enables storing a binary mask instead of activation maps for the backward pass.
MobileTL fine-tunes a few top blocks (close to output) rather than propagating
the gradient through the whole network to reduce the computation cost. Our
method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs,
respectively. For MobileNetV3, we observe a 36% reduction in floating-point
operations (FLOPs) when fine-tuning 5 blocks, while only incurring a 0.6%
accuracy reduction on CIFAR10. Extensive experiments on multiple datasets
demonstrate that our method is Pareto-optimal (best accuracy under given
hardware constraints) compared to prior work in transfer learning for edge
devices.
Related papers
- TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading [13.283682311968752]
TBA is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed.
We show that TBA effectively reduces 47% of the activation peak memory usage.
At the same time, TBA perfectly overlaps the I/O with the computation and incurs negligible performance overhead.
arXiv Detail & Related papers (2024-08-19T14:09:48Z) - Block Selective Reprogramming for On-device Training of Vision Transformers [12.118303034660531]
We present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model.
Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x.
arXiv Detail & Related papers (2024-03-25T08:41:01Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction [50.248694764703714]
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction.
These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization.
We propose Greedy LEarning for Accelerated MRI reconstruction, an efficient training strategy for high-dimensional imaging settings.
arXiv Detail & Related papers (2022-07-18T06:01:29Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer
Learning [82.93130407930762]
It is costly to update the entire parameter set of large pre-trained models.
PETL techniques allow updating a small subset of parameters inside a pre-trained backbone network for a new task.
We propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts.
arXiv Detail & Related papers (2022-06-13T23:51:56Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Memory Efficient 3D U-Net with Reversible Mobile Inverted Bottlenecks
for Brain Tumor Segmentation [4.134876686331775]
We propose combining memory saving techniques with traditional U-Net architectures to increase the complexity of the models on the Brain Tumor (BraTS) challenge.
Our 3D U-Net uses a reversible version of the mobile inverted bottleneck block to save activation memory during training.
We are able to train image volumes up to 3x larger, models with 25% more depth, or models with up to 2x the number of channels than a corresponding non-reversible network.
arXiv Detail & Related papers (2021-04-19T21:23:55Z) - Layer Pruning via Fusible Residual Convolutional Block for Deep Neural
Networks [15.64167076052513]
layer pruning has less inference time and runtime memory usage when the same FLOPs and number of parameters are pruned.
We propose a simple layer pruning method using residual convolutional block (ResConv)
Our pruning method achieves excellent performance of compression and acceleration over the state-thearts on different datasets.
arXiv Detail & Related papers (2020-11-29T12:51:16Z) - TinyTL: Reduce Activations, Not Trainable Parameters for Efficient
On-Device Learning [78.80707950262214]
On-device learning enables edge devices to continually adapt the AI models to new data.
Existing work solves this problem by reducing the number of trainable parameters.
We present Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning.
arXiv Detail & Related papers (2020-07-22T18:39:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.