SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using
Training Dynamics
- URL: http://arxiv.org/abs/2305.18513v1
- Date: Mon, 29 May 2023 17:50:52 GMT
- Title: SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using
Training Dynamics
- Authors: Arash Ardakani, Altan Haan, Shangyin Tan, Doru Thom Popovici, Alvin
Cheung, Costin Iancu, Koushik Sen
- Abstract summary: Transformer-based models, such as BERT and ViT, have achieved state-of-the-art results across different natural language processing (NLP) and computer vision (CV) tasks.
These models are extremely memory intensive during their fine-tuning process.
We introduce a new tool called SlimFit that reduces the memory requirements of these models by dynamically analyzing their training dynamics.
- Score: 16.94357817641467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based models, such as BERT and ViT, have achieved
state-of-the-art results across different natural language processing (NLP) and
computer vision (CV) tasks. However, these models are extremely memory
intensive during their fine-tuning process, making them difficult to deploy on
GPUs with limited memory resources. To address this issue, we introduce a new
tool called SlimFit that reduces the memory requirements of these models by
dynamically analyzing their training dynamics and freezing less-contributory
layers during fine-tuning. The layers to freeze are chosen using a runtime
inter-layer scheduling algorithm. SlimFit adopts quantization and pruning for
particular layers to balance the load of dynamic activations and to minimize
the memory footprint of static activations, where static activations refer to
those that cannot be discarded regardless of freezing. This allows SlimFit to
freeze up to 95% of layers and reduce the overall on-device GPU memory usage of
transformer-based models such as ViT and BERT by an average of 2.2x, across
different NLP and CV benchmarks/datasets such as GLUE, SQuAD 2.0, CIFAR-10,
CIFAR-100 and ImageNet with an average degradation of 0.2% in accuracy. For
such NLP and CV tasks, SlimFit can reduce up to 3.1x the total on-device memory
usage with an accuracy degradation of only up to 0.4%. As a result, while
fine-tuning of ViT on ImageNet and BERT on SQuAD 2.0 with a batch size of 128
requires 3 and 2 32GB GPUs respectively, SlimFit enables their fine-tuning on a
single 32GB GPU without any significant accuracy degradation.
Related papers
- FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models [15.244129138320782]
FlashSVD is an end-to-end rank-aware streaming inference framework for SVD-compressed large language models.<n>It cuts peak activation memory by up to 70.2% and intermediate transient memory by 75%.<n>It incurs no accuracy loss with upstream encodering compression methods, offering a practical path toward memory-constrained deployment of low-rank LLMs.
arXiv Detail & Related papers (2025-08-02T22:06:46Z) - 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [71.43026659686679]
Large Language Models (LLMs) have grown rapidly in size, creating challenges for efficient deployment on resource-constrained hardware.
We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z) - Dynamic Gradient Sparse Update for Edge Training [0.0502254944841629]
gradient computation for backpropagation in the training requires significant memory buffers to store intermediate features and compute losses.
This is unacceptable for memory-constrained edge devices such as microcontrollers.
We propose a training acceleration method using dynamic gradient sparse updates to reduce memory usage.
arXiv Detail & Related papers (2025-03-23T06:32:12Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Block Selective Reprogramming for On-device Training of Vision Transformers [12.118303034660531]
We present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model.
Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x.
arXiv Detail & Related papers (2024-03-25T08:41:01Z) - Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank
Compression Strategy [5.699098817569033]
This paper introduces an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs.
The presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset.
In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain.
arXiv Detail & Related papers (2024-02-08T19:01:14Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Efficient Fine-Tuning of BERT Models on the Edge [12.768368718187428]
We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models.
FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%.
More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.
arXiv Detail & Related papers (2022-05-03T14:51:53Z) - Stochastic Backpropagation: A Memory Efficient Strategy for Training
Video Models [42.31924917984774]
We propose a memory efficient method, named Backpropagation (SBP), for training deep neural networks on videos.
Experiments show that SBP can be applied to a wide range of models for video tasks, leading to up to 80.0% GPU memory saving and 10% training speedup with less than 1% accuracy drop on action recognition and temporal action detection.
arXiv Detail & Related papers (2022-03-31T02:24:53Z) - Highly Efficient Salient Object Detection with 100K Parameters [137.74898755102387]
We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features.
We build an extremely light-weighted model, namely CSNet, which achieves comparable performance with about 0.2% (100k) of large models on popular object detection benchmarks.
arXiv Detail & Related papers (2020-03-12T07:00:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.