SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using
Training Dynamics
- URL: http://arxiv.org/abs/2305.18513v1
- Date: Mon, 29 May 2023 17:50:52 GMT
- Title: SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using
Training Dynamics
- Authors: Arash Ardakani, Altan Haan, Shangyin Tan, Doru Thom Popovici, Alvin
Cheung, Costin Iancu, Koushik Sen
- Abstract summary: Transformer-based models, such as BERT and ViT, have achieved state-of-the-art results across different natural language processing (NLP) and computer vision (CV) tasks.
These models are extremely memory intensive during their fine-tuning process.
We introduce a new tool called SlimFit that reduces the memory requirements of these models by dynamically analyzing their training dynamics.
- Score: 16.94357817641467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based models, such as BERT and ViT, have achieved
state-of-the-art results across different natural language processing (NLP) and
computer vision (CV) tasks. However, these models are extremely memory
intensive during their fine-tuning process, making them difficult to deploy on
GPUs with limited memory resources. To address this issue, we introduce a new
tool called SlimFit that reduces the memory requirements of these models by
dynamically analyzing their training dynamics and freezing less-contributory
layers during fine-tuning. The layers to freeze are chosen using a runtime
inter-layer scheduling algorithm. SlimFit adopts quantization and pruning for
particular layers to balance the load of dynamic activations and to minimize
the memory footprint of static activations, where static activations refer to
those that cannot be discarded regardless of freezing. This allows SlimFit to
freeze up to 95% of layers and reduce the overall on-device GPU memory usage of
transformer-based models such as ViT and BERT by an average of 2.2x, across
different NLP and CV benchmarks/datasets such as GLUE, SQuAD 2.0, CIFAR-10,
CIFAR-100 and ImageNet with an average degradation of 0.2% in accuracy. For
such NLP and CV tasks, SlimFit can reduce up to 3.1x the total on-device memory
usage with an accuracy degradation of only up to 0.4%. As a result, while
fine-tuning of ViT on ImageNet and BERT on SQuAD 2.0 with a batch size of 128
requires 3 and 2 32GB GPUs respectively, SlimFit enables their fine-tuning on a
single 32GB GPU without any significant accuracy degradation.
Related papers
- Block Selective Reprogramming for On-device Training of Vision Transformers [12.118303034660531]
We present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model.
Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x.
arXiv Detail & Related papers (2024-03-25T08:41:01Z) - Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank
Compression Strategy [5.699098817569033]
This paper introduces an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs.
The presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset.
In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain.
arXiv Detail & Related papers (2024-02-08T19:01:14Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of
Language Model [92.55145016562867]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Efficient Fine-Tuning of BERT Models on the Edge [12.768368718187428]
We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models.
FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%.
More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.
arXiv Detail & Related papers (2022-05-03T14:51:53Z) - Stochastic Backpropagation: A Memory Efficient Strategy for Training
Video Models [42.31924917984774]
We propose a memory efficient method, named Backpropagation (SBP), for training deep neural networks on videos.
Experiments show that SBP can be applied to a wide range of models for video tasks, leading to up to 80.0% GPU memory saving and 10% training speedup with less than 1% accuracy drop on action recognition and temporal action detection.
arXiv Detail & Related papers (2022-03-31T02:24:53Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - TinyTL: Reduce Activations, Not Trainable Parameters for Efficient
On-Device Learning [78.80707950262214]
On-device learning enables edge devices to continually adapt the AI models to new data.
Existing work solves this problem by reducing the number of trainable parameters.
We present Tiny-Transfer-Learning (TinyTL) for memory-efficient on-device learning.
arXiv Detail & Related papers (2020-07-22T18:39:53Z) - Highly Efficient Salient Object Detection with 100K Parameters [137.74898755102387]
We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features.
We build an extremely light-weighted model, namely CSNet, which achieves comparable performance with about 0.2% (100k) of large models on popular object detection benchmarks.
arXiv Detail & Related papers (2020-03-12T07:00:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.