Adam Accumulation to Reduce Memory Footprints of both Activations and
Gradients for Large-scale DNN Training
- URL: http://arxiv.org/abs/2305.19982v1
- Date: Wed, 31 May 2023 16:06:50 GMT
- Title: Adam Accumulation to Reduce Memory Footprints of both Activations and
Gradients for Large-scale DNN Training
- Authors: Yijia Zhang, Yibo Han, Shijie Cao, Guohao Dai, Youshan Miao, Ting Cao,
Fan Yang, Ningyi Xu
- Abstract summary: We propose a novel accumulation method for Adam, named Adam Accumulation (AdamA), which enables reducing both activation and gradient memory.
Specifically, AdamA directly integrates gradients into states and accumulates states over micro-batches, so that gradients can be released immediately after use.
AdamA achieves up to 23% memory reduction compared to gradient accumulation with less than 2% in training throughput.
- Score: 6.0904817096340125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Running out of GPU memory has become a main bottleneck for large-scale DNN
training. How to reduce the memory footprint during training has received
intensive research attention. We find that previous gradient accumulation
reduces activation memory but fails to be compatible with gradient memory
reduction due to a contradiction between preserving gradients and releasing
gradients. To address this issue, we propose a novel optimizer accumulation
method for Adam, named Adam Accumulation (AdamA), which enables reducing both
activation and gradient memory. Specifically, AdamA directly integrates
gradients into optimizer states and accumulates optimizer states over
micro-batches, so that gradients can be released immediately after use. We
mathematically and experimentally demonstrate AdamA yields the same convergence
properties as Adam. Evaluated on transformer-based models, AdamA achieves up to
23% memory reduction compared to gradient accumulation with less than 2%
degradation in training throughput. Notably, AdamA can work together with
memory reduction methods for optimizer states to fit 1.26x~3.14x larger models
over PyTorch and DeepSpeed baseline on GPUs with different memory capacities.
Related papers
- When Can You Get Away with Low Memory Adam? [48.30892531847662]
We show that $textitSlimAdam$ matches Adam's performance and stability while saving up to $98%$ of total second moments.
Code for $textitSlimAdam$ is available at https://github.com/dayal-kalra/low-memory-adam.
arXiv Detail & Related papers (2025-03-03T18:59:40Z) - APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.
Various memory-efficient Scals have been proposed to reduce memory usage.
They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z) - COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection [17.54863041098623]
We present COAP, a memory-efficient method that minimizes computational overhead while maintaining training performance.
For LLaMA-1B, it reduces memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW.
With 8-bit quantization, COAP cuts memory by 81% and 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.
arXiv Detail & Related papers (2024-11-26T03:50:52Z) - Cut Your Losses in Large-Vocabulary Language Models [102.6981011879656]
We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory.
CCE reduces the memory footprint of the loss from 24 GB to 1 MB, and the total training-time memory consumption of the head from 28 GB to 1 GB.
arXiv Detail & Related papers (2024-11-13T20:30:15Z) - Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients [24.58231358634904]
Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory.
We propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages sparse projections to transform gradients into structured sparse updates.
arXiv Detail & Related papers (2024-06-25T15:50:32Z) - MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence [35.17459630834073]
We propose a new variant of the Adam that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees.
We control the resulting compression error via a novel instance of the classical empherror feedback mechanism from distributed optimization.
We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance.
arXiv Detail & Related papers (2024-05-24T14:25:23Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Large Scale Private Learning via Low-rank Reparametrization [77.38947817228656]
We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks.
We are the first able to apply differential privacy on the BERT model and achieve an average accuracy of $83.9%$ on four downstream tasks.
arXiv Detail & Related papers (2021-06-17T10:14:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.