MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence
- URL: http://arxiv.org/abs/2405.15593v2
- Date: Tue, 05 Nov 2024 15:15:13 GMT
- Title: MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence
- Authors: Ionut-Vlad Modoranu, Mher Safaryan, Grigory Malinovsky, Eldar Kurtic, Thomas Robert, Peter Richtarik, Dan Alistarh,
- Abstract summary: We propose a new variant of the Adam that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees.
We control the resulting compression error via a novel instance of the classical empherror feedback mechanism from distributed optimization.
We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance.
- Score: 35.17459630834073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new variant of the Adam optimizer called MicroAdam that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed into the optimizer state, thereby reducing its memory footprint significantly. We control the resulting compression error via a novel instance of the classical \emph{error feedback} mechanism from distributed optimization in which *the error correction information is itself compressed* to allow for practical memory gains. We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance. Specifically, we show that MicroAdam can be implemented efficiently on GPUs: on both million-scale (BERT) and billion-scale (LLaMA) models, MicroAdam provides practical convergence competitive to that of the uncompressed Adam baseline, with lower memory usage and similar running time. Our code is available at https://github.com/IST-DASLab/MicroAdam.
Related papers
- COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs [81.01082659623552]
Large Language Models (LLMs) have demonstrated remarkable success across various domains.
Their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit.
arXiv Detail & Related papers (2025-02-24T18:42:19Z) - APOLLO: SGD-like Memory, AdamW-level Performance [61.53444035835778]
Large language models (LLMs) are notoriously memory-intensive during training.
Various memory-efficient Scals have been proposed to reduce memory usage.
They face critical challenges: (i) costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial memory overhead to maintain competitive performance.
arXiv Detail & Related papers (2024-12-06T18:55:34Z) - Cautious Optimizers: Improving Training with One Line of Code [8.393403749426097]
We rename AdamW as Cautious, e.g. C-AdamW.
A whole new family of outcomes is revealed by our theoretical insight.
arXiv Detail & Related papers (2024-11-25T04:36:01Z) - Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics [37.21593513802284]
We introduce LDAdam, a memory-efficient gradient for training large models.
We show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.
arXiv Detail & Related papers (2024-10-21T15:31:06Z) - LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss.
We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality.
Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Adam Accumulation to Reduce Memory Footprints of both Activations and
Gradients for Large-scale DNN Training [6.0904817096340125]
We propose a novel accumulation method for Adam, named Adam Accumulation (AdamA), which enables reducing both activation and gradient memory.
Specifically, AdamA directly integrates gradients into states and accumulates states over micro-batches, so that gradients can be released immediately after use.
AdamA achieves up to 23% memory reduction compared to gradient accumulation with less than 2% in training throughput.
arXiv Detail & Related papers (2023-05-31T16:06:50Z) - Maximizing Communication Efficiency for Large-scale Training via 0/1
Adam [49.426602335460295]
1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD.
We propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel methods.
arXiv Detail & Related papers (2022-02-12T08:02:23Z) - 1-bit Adam: Communication Efficient Large-Scale Training with Adam's
Convergence Speed [39.23129626683372]
Communication has become a major bottleneck on commodity systems with standard TCP interconnects that offer limited network bandwidth.
One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression.
We propose 1-bit Adam that reduces the communication volume by up to $5times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam.
arXiv Detail & Related papers (2021-02-04T21:02:19Z) - APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD
Algorithm [39.110478306078974]
Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet.
We propose a communication efficient bf ADAM bf preconditioned bf Momentum SGD algorithm-- named APMSqueeze-- through an error compensated method compressing gradients.
arXiv Detail & Related papers (2020-08-26T02:20:23Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.