Tempo: Accelerating Transformer-Based Model Training through Memory
Footprint Reduction
- URL: http://arxiv.org/abs/2210.10246v1
- Date: Wed, 19 Oct 2022 01:59:37 GMT
- Title: Tempo: Accelerating Transformer-Based Model Training through Memory
Footprint Reduction
- Authors: Muralidhar Andoorveedu, Zhanda Zhu, Bojian Zheng, Gennady Pekhimenko
- Abstract summary: We propose Tempo, a new approach to efficiently use accelerator memory resources for training Transformer-based models.
Our approach provides drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing the memory usage.
We demonstrate that Tempo enables up to 2x higher batch sizes and 16% higher training throughput over the state-of-the-art baseline.
- Score: 3.5831119917067737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training deep learning models can be computationally expensive. Prior works
have shown that increasing the batch size can potentially lead to better
overall throughput. However, the batch size is frequently limited by the
accelerator memory capacity due to the activations/feature maps stored for the
training backward pass, as larger batch sizes require larger feature maps to be
stored. Transformer-based models, which have recently seen a surge in
popularity due to their good performance and applicability to a variety of
tasks, have a similar problem. To remedy this issue, we propose Tempo, a new
approach to efficiently use accelerator (e.g., GPU) memory resources for
training Transformer-based models. Our approach provides drop-in replacements
for the GELU, LayerNorm, and Attention layers, reducing the memory usage and
ultimately leading to more efficient training. We implement Tempo and evaluate
the throughput, memory usage, and accuracy/loss on the BERT Large pre-training
task. We demonstrate that Tempo enables up to 2x higher batch sizes and 16%
higher training throughput over the state-of-the-art baseline. We also evaluate
Tempo on GPT2 and RoBERTa models, showing 19% and 26% speedup over the
baseline.
Related papers
- TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading [13.283682311968752]
TBA is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed.
We show that TBA effectively reduces 47% of the activation peak memory usage.
At the same time, TBA perfectly overlaps the I/O with the computation and incurs negligible performance overhead.
arXiv Detail & Related papers (2024-08-19T14:09:48Z) - Block Selective Reprogramming for On-device Training of Vision Transformers [12.118303034660531]
We present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model.
Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x.
arXiv Detail & Related papers (2024-03-25T08:41:01Z) - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Adaptive Cross Batch Normalization for Metric Learning [75.91093210956116]
Metric learning is a fundamental problem in computer vision.
We show that it is equally important to ensure that the accumulated embeddings are up to date.
In particular, it is necessary to circumvent the representational drift between the accumulated embeddings and the feature embeddings at the current training iteration.
arXiv Detail & Related papers (2023-03-30T03:22:52Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Improving Computational Efficiency in Visual Reinforcement Learning via
Stored Embeddings [89.63764845984076]
We present Stored Embeddings for Efficient Reinforcement Learning (SEER)
SEER is a simple modification of existing off-policy deep reinforcement learning methods.
We show that SEER does not degrade the performance of RLizable agents while significantly saving computation and memory.
arXiv Detail & Related papers (2021-03-04T08:14:10Z) - Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy.
At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy.
This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.