TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading
- URL: http://arxiv.org/abs/2408.10013v1
- Date: Mon, 19 Aug 2024 14:09:48 GMT
- Title: TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading
- Authors: Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu,
- Abstract summary: TBA is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed.
We show that TBA effectively reduces 47% of the activation peak memory usage.
At the same time, TBA perfectly overlaps the I/O with the computation and incurs negligible performance overhead.
- Score: 13.283682311968752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations -- the intermediate tensors produced during forward propagation and reused in backward propagation -- dominate the GPU memory use. To address this challenge, we propose TBA to efficiently offload activations to high-capacity NVMe SSDs. This approach reduces GPU memory usage without impacting performance by adaptively overlapping data transfers with computation. TBA is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication, forwarding, and adaptive offloading to further enhance efficiency. We conduct extensive experiments on GPT, BERT, and T5. Results demonstrate that TBA effectively reduces 47% of the activation peak memory usage. At the same time, TBA perfectly overlaps the I/O with the computation and incurs negligible performance overhead. We introduce the recompute-offload-keep (ROK) curve to compare the TBA offloading with other two tensor placement strategies, keeping activations in memory and layerwise full recomputation. We find that TBA achieves better memory savings than layerwise full recomputation while retaining the performance of keeping the activations in memory.
Related papers
- Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs [24.066283519769968]
Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications.
We propose MEMO, a novel framework for fine-grained activation memory management.
We show that MEMO achieves an average of 2.42x and 2.26x MFU compared to Megatron-LM and DeepSpeed.
arXiv Detail & Related papers (2024-07-16T18:59:49Z) - Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation [29.139579820699495]
This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization.
We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions.
In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers.
arXiv Detail & Related papers (2024-06-24T03:09:15Z) - Contractive error feedback for gradient compression [60.05809370598166]
We propose a communication efficient method called contractive error feedback (ConEF)
As opposed to SGD with error-feedback (EFSGD) that inefficiently manages memory, ConEF obtains the sweet spot of convergence and memory usage.
We empirically validate ConEF on various learning tasks that include image classification, language modeling, and machine translation.
arXiv Detail & Related papers (2023-12-13T21:54:21Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - MobileTL: On-device Transfer Learning with Inverted Residual Blocks [14.305834934988185]
We present MobileTL, a transfer learning method for models built with Inverted Residual Blocks (IRBs)
MobileTL trains the shifts for internal normalization layers to avoid storing activation maps for the backward pass.
Our method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs, respectively.
arXiv Detail & Related papers (2022-12-05T23:07:55Z) - Tempo: Accelerating Transformer-Based Model Training through Memory
Footprint Reduction [3.5831119917067737]
We propose Tempo, a new approach to efficiently use accelerator memory resources for training Transformer-based models.
Our approach provides drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing the memory usage.
We demonstrate that Tempo enables up to 2x higher batch sizes and 16% higher training throughput over the state-of-the-art baseline.
arXiv Detail & Related papers (2022-10-19T01:59:37Z) - DIVISION: Memory Efficient Training via Dual Activation Precision [60.153754740511864]
State-of-the-art work combines a search of quantization bit-width with the training, which makes the procedure complicated and less transparent.
We propose a simple and effective method to compress DNN training.
Experiment results show DIVISION has better comprehensive performance than state-of-the-art methods, including over 10x compression of activation maps and competitive training throughput, without loss of model accuracy.
arXiv Detail & Related papers (2022-08-05T03:15:28Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Improving Computational Efficiency in Visual Reinforcement Learning via
Stored Embeddings [89.63764845984076]
We present Stored Embeddings for Efficient Reinforcement Learning (SEER)
SEER is a simple modification of existing off-policy deep reinforcement learning methods.
We show that SEER does not degrade the performance of RLizable agents while significantly saving computation and memory.
arXiv Detail & Related papers (2021-03-04T08:14:10Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.