Related papers: LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

URL: http://arxiv.org/abs/2410.16103v3
Date: Thu, 07 Nov 2024 14:00:45 GMT
Title: LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics
Authors: Thomas Robert, Mher Safaryan, Ionut-Vlad Modoranu, Dan Alistarh,
Abstract summary: We introduce LDAdam, a memory-efficient gradient for training large models. We show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.
Score: 37.21593513802284
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.

Related papers

Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
SubTrack++ : Gradient Subspace Tracking for Scalable LLM Training [6.057289837472806]
Training large language models (LLMs) is highly resource-intensive due to their massive number of parameters and the overhead of states.<n>We propose SubTrack++ that leverages Grassmannian gradient subspace tracking combined with projection-awares.<n>We demonstrate SOTA convergence by exploiting Grassmannian geometry and lowest evaluation loss.
arXiv Detail & Related papers (2025-02-03T18:13:52Z)
SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training [16.037614012166063]
Gradient Descent (SGD) is a stateless, scalability as it does not track state variables during training. In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam for training LLM. We show that normalization stabilizes gradient, and counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), aapprox that eliminates the need to store any states.
arXiv Detail & Related papers (2024-12-17T18:13:18Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase. Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative. We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling Approach [11.878350833222711]
We propose a method called em GradSamp for sampling gradient updates from a Gaussian distribution. em GradSamp not only streamlines gradient but also enables skipping entire epochs, thereby enhancing overall efficiency. We rigorously validate our hypothesis across a diverse set of standard and non-standard CNN and transformer-based models.
arXiv Detail & Related papers (2024-06-11T15:01:20Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks. Recent examples include AdaGrad and Adam. We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning. It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.