Reducing Activation Recomputation in Large Transformer Models
- URL: http://arxiv.org/abs/2205.05198v1
- Date: Tue, 10 May 2022 22:40:17 GMT
- Title: Reducing Activation Recomputation in Large Transformer Models
- Authors: Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael
Andersch, Mohammad Shoeybi, Bryan Catanzaro
- Abstract summary: We show how to significantly accelerate training of large transformer models by reducing activation recomputation.
We present two novel yet very simple techniques: sequence parallelism and selective activation recomputation.
Our method reduces activation memory by 5x, while reducing execution time overhead from activation recomputation by over 90%.
- Score: 17.810669621463962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training large transformer models is one of the most important computational
challenges of modern AI. In this paper, we show how to significantly accelerate
training of large transformer models by reducing activation recomputation.
Activation recomputation is commonly used to work around memory capacity
constraints. Rather than storing activations for backpropagation, they are
traditionally recomputed, which saves memory but adds redundant compute. In
this work, we show most of this redundant compute is unnecessary because we can
reduce memory consumption sufficiently without it. We present two novel yet
very simple techniques: sequence parallelism and selective activation
recomputation. In conjunction with tensor parallelism, these techniques almost
eliminate the need to recompute activations. We evaluate our approach on
language models up to one trillion parameters in scale and show that our method
reduces activation memory by 5x, while reducing execution time overhead from
activation recomputation by over 90%. For example, when training a 530B
parameter GPT-3 style model on 2240 NVIDIA A100 GPUs, we achieve a Model Flops
Utilization of 54.2%, which is 29% faster than the 42.1% we achieve using
recomputation. Our implementation will be available in both Megatron-LM and
NeMo-Megatron.
Related papers
- VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks.
We identify and characterise the important components needed for effective model convergence using gradient descent.
This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs.
arXiv Detail & Related papers (2024-05-28T09:23:14Z) - FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models
for Financial Applications with High-Performance Computing [10.47214968497857]
We present high-performance methods that exploit low-rank structures to pretrain and finetune large language models.
Our methods achieve a speedup of 1.3X and a model compression ratio of 2.64X for pretaining without accuracy drop.
For finetuning, our methods achieve an average accuracy increase of 6.3% and 24.0% in general tasks and financial tasks.
arXiv Detail & Related papers (2024-02-21T05:03:17Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient.
We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Improving compute efficacy frontiers with SliceOut [31.864949424541344]
We introduce SliceOut -- a dropout-inspired scheme to train deep learning models faster without impacting final test accuracy.
At test time, turning off SliceOut performs an implicit ensembling across a linear number of architectures that preserves test accuracy.
This leads to faster processing of large computational workloads overall, and significantly reduce the resulting energy consumption and CO2emissions.
arXiv Detail & Related papers (2020-07-21T15:59:09Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.