Related papers: SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training

SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training

URL: http://arxiv.org/abs/2408.10013v2
Date: Sat, 15 Feb 2025 22:39:05 GMT
Title: SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training
Authors: Kun Wu, Jeongmin Brian Park, Xiaofan Zhang, Mert Hidayetoğlu, Vikram Sharma Mailthody, Sitao Huang, Steven Sam Lumetta, Wen-mei Hwu,
Abstract summary: SSDTrain is an adaptive activation framework offloading to high-capacity GPU memory.<n>It is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed.<n>Results demonstrate that SSDTrain reduces 47% of the activation peak memory usage.
Score: 13.283682311968752
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations -- the intermediate tensors produced during forward propagation and reused in backward propagation -- dominate the GPU memory use. This leads to high training overhead such as high weight update cost due to the small micro-batch size. To address this challenge, we propose SSDTrain, an adaptive activation offloading framework to high-capacity NVMe SSDs. SSDTrain reduces GPU memory usage without impacting performance by fully overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication and forwarding to further enhance efficiency. We extensively experimented with popular LLMs like GPT, BERT, and T5. Results demonstrate that SSDTrain reduces 47% of the activation peak memory usage. Meanwhile, SSDTrain perfectly overlaps the I/O with the computation and incurs negligible overhead. Compared with keeping activations in GPU memory and layerwise full recomputation, SSDTrain achieves the best memory savings with negligible throughput loss. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles.

Related papers

Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training [9.775731832789116]
We introduce STWeaver, a GPU memory allator for deep learning frameworks that reduces fragmentation by exploiting temporal regularity in memory allocation behaviors.<n>Built as a plug PyTorch, STWeaver reduces fragmentation ratio on average by 79.2% (up to 100%) across both dense and sparse models, with negligible overhead.
arXiv Detail & Related papers (2025-07-22T06:39:07Z)
Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage [9.106167012987747]
TERAIO is a framework for GPU memory expansion using low-cost PCIe-based solid-state drives (SSDs)<n>Its design is driven by our observation that the active tensors take only a small fraction (1.7% on average) of allocated GPU memory in each large language iteration training process.<n>We show that TERAIO improves the training performance of various LLMs by 1.47x on average, and achieves 80.7% of the ideal performance.
arXiv Detail & Related papers (2025-06-06T18:57:20Z)
Exploring the Benefit of Activation Sparsity in Pre-training [117.25661020250658]
We study how activation properties change during pre-training. We propose Switchable Sparse-Dense Learning (SSD) SSD achieves comparable performance with identical model size and reduces pre-training costs.
arXiv Detail & Related papers (2024-10-04T13:53:33Z)
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [10.115950753431528]
Large Language Models (LLMs) are a significant milestone in generative AI. The increasing context length and batch size in offline LLM inference escalates the memory requirement of the key-value (KV) cache. Several cost-effective solutions leverage host memory or optimized to reduce storage costs for offline inference scenarios. We propose InstInfer, which offloads the most performance-critical computation (i.e., attention in decoding phase) and data (i.e., KV cache) parts to Computational Storage Drives (CSDs) InstInfer improves throughput for long-sequence inference by
arXiv Detail & Related papers (2024-09-08T06:06:44Z)
Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs [24.066283519769968]
Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. We propose MEMO, a novel framework for fine-grained activation memory management. We show that MEMO achieves an average of 2.42x and 2.26x MFU compared to Megatron-LM and DeepSpeed.
arXiv Detail & Related papers (2024-07-16T18:59:49Z)
Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation [29.139579820699495]
This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers.
arXiv Detail & Related papers (2024-06-24T03:09:15Z)
ProTrain: Efficient LLM Training via Memory-Aware Techniques [18.30799115938978]
This paper proposes ProTrain, a novel training system that balances memory usage and performance by coordinating memory, computation, and IO. ProTrain improves training throughput by 1.43$times$ to 2.71$times compared to the SOTA training systems.
arXiv Detail & Related papers (2024-06-12T15:40:06Z)
S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs [7.816840847892339]
Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference. We propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping. Our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data.
arXiv Detail & Related papers (2024-05-30T17:54:35Z)
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states. In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy. Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z)
Contractive error feedback for gradient compression [60.05809370598166]
We propose a communication efficient method called contractive error feedback (ConEF) As opposed to SGD with error-feedback (EFSGD) that inefficiently manages memory, ConEF obtains the sweet spot of convergence and memory usage. We empirically validate ConEF on various learning tasks that include image classification, language modeling, and machine translation.
arXiv Detail & Related papers (2023-12-13T21:54:21Z)
Fast Machine Unlearning Without Retraining Through Selective Synaptic Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data. We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
MobileTL: On-device Transfer Learning with Inverted Residual Blocks [14.305834934988185]
We present MobileTL, a transfer learning method for models built with Inverted Residual Blocks (IRBs) MobileTL trains the shifts for internal normalization layers to avoid storing activation maps for the backward pass. Our method reduces memory usage by 46% and 53% for MobileNetV2 and V3 IRBs, respectively.
arXiv Detail & Related papers (2022-12-05T23:07:55Z)
Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction [3.5831119917067737]
We propose Tempo, a new approach to efficiently use accelerator memory resources for training Transformer-based models. Our approach provides drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing the memory usage. We demonstrate that Tempo enables up to 2x higher batch sizes and 16% higher training throughput over the state-of-the-art baseline.
arXiv Detail & Related papers (2022-10-19T01:59:37Z)
DIVISION: Memory Efficient Training via Dual Activation Precision [60.153754740511864]
State-of-the-art work combines a search of quantization bit-width with the training, which makes the procedure complicated and less transparent. We propose a simple and effective method to compress DNN training. Experiment results show DIVISION has better comprehensive performance than state-of-the-art methods, including over 10x compression of activation maps and competitive training throughput, without loss of model accuracy.
arXiv Detail & Related papers (2022-08-05T03:15:28Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
Improving Computational Efficiency in Visual Reinforcement Learning via Stored Embeddings [89.63764845984076]
We present Stored Embeddings for Efficient Reinforcement Learning (SEER) SEER is a simple modification of existing off-policy deep reinforcement learning methods. We show that SEER does not degrade the performance of RLizable agents while significantly saving computation and memory.
arXiv Detail & Related papers (2021-03-04T08:14:10Z)
SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage. We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation. We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.