Related papers: When Can You Get Away with Low Memory Adam?

When Can You Get Away with Low Memory Adam?

URL: http://arxiv.org/abs/2503.01843v3
Date: Mon, 17 Mar 2025 18:55:25 GMT
Title: When Can You Get Away with Low Memory Adam?
Authors: Dayal Singh Kalra, John Kirchenbauer, Maissam Barkeshli, Tom Goldstein,
Abstract summary: We show that $textitSlimAdam$ matches Adam's performance and stability while saving up to $98%$ of total second moments.<n>Code for $textitSlimAdam$ is available at https://github.com/dayal-kalra/low-memory-adam.
Score: 48.30892531847662
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adam is the go-to optimizer for training modern machine learning models, but it requires additional memory to maintain the moving averages of the gradients and their squares. While various low-memory optimizers have been proposed that sometimes match the performance of Adam, their lack of reliability has left Adam as the default choice. In this work, we apply a simple layer-wise Signal-to-Noise Ratio (SNR) analysis to quantify when second-moment tensors can be effectively replaced by their means across different dimensions. Our SNR analysis reveals how architecture, training hyperparameters, and dataset properties impact compressibility along Adam's trajectory, naturally leading to $\textit{SlimAdam}$, a memory-efficient Adam variant. $\textit{SlimAdam}$ compresses the second moments along dimensions with high SNR when feasible, and leaves when compression would be detrimental. Through experiments across a diverse set of architectures and training scenarios, we show that $\textit{SlimAdam}$ matches Adam's performance and stability while saving up to $98\%$ of total second moments. Code for $\textit{SlimAdam}$ is available at https://github.com/dayal-kalra/low-memory-adam.

Related papers

Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity [6.270305440413688]
We find that Adam performs worse when the favorable $ell_infty$-geometry is SGD while provably remains unaffected. Our experiments confirm that Adam performs much worse when the favorable $ell_infty$-geometry is SGD while provably remains unaffected.
arXiv Detail & Related papers (2024-10-10T17:58:53Z)
Adam-mini: Use Fewer Learning Rates To Gain More [29.170425801678952]
Adam-mini reduces memory cutting down the learning rate resources in Adam.<n>Adam-mini achieves on par or better performance than AdamW with 50% less memory footprint.
arXiv Detail & Related papers (2024-06-24T16:56:41Z)
Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training [6.0904817096340125]
We propose a novel accumulation method for Adam, named Adam Accumulation (AdamA), which enables reducing both activation and gradient memory. Specifically, AdamA directly integrates gradients into states and accumulates states over micro-batches, so that gradients can be released immediately after use. AdamA achieves up to 23% memory reduction compared to gradient accumulation with less than 2% in training throughput.
arXiv Detail & Related papers (2023-05-31T16:06:50Z)
Symbolic Discovery of Optimization Algorithms [132.62397077095787]
We use efficient search techniques to explore an infinite and sparse program space. Our method discovers a simple and effective optimization algorithm, $textbfLion$. Lion is successfully deployed in production systems such as Google search ads CTR model.
arXiv Detail & Related papers (2023-02-13T20:27:30Z)
Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence. Existing convergence analyses for Adam rely on the bounded smoothness assumption. This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z)
Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam [49.426602335460295]
1-bit communication is an effective method to scale up model training, and has been studied extensively on SGD. We propose 0/1 Adam, which improves upon the state-of-the-art 1-bit Adam via two novel methods.
arXiv Detail & Related papers (2022-02-12T08:02:23Z)
Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$. AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$. We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications. We propose a new method named Adam$+$ (pronounced as Adam-plus) Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.