AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training
- URL: http://arxiv.org/abs/2505.16363v1
- Date: Thu, 22 May 2025 08:16:48 GMT
- Title: AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training
- Authors: Huishuai Zhang, Bohan Wang, Luoxin Chen,
- Abstract summary: We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training.<n>By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates.<n>AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance.
- Score: 22.58304858379219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed $(L_0, L_1)$ smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.
Related papers
- In Search of Adam's Secret Sauce [11.215133680044005]
We train over 1,300 language models across different data configurations and scales.<n>We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam.<n>We show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients.
arXiv Detail & Related papers (2025-05-27T23:30:18Z) - When Can You Get Away with Low Memory Adam? [48.30892531847662]
We show that $textitSlimAdam$ matches Adam's performance and stability while saving up to $98%$ of total second moments.<n>Code for $textitSlimAdam$ is available at https://github.com/dayal-kalra/low-memory-adam.
arXiv Detail & Related papers (2025-03-03T18:59:40Z) - Towards Simple and Provable Parameter-Free Adaptive Gradient Methods [56.060918447252625]
We present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees.<n>We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates.
arXiv Detail & Related papers (2024-12-27T04:22:02Z) - CAdam: Confidence-Based Optimization for Online Learning [35.84013976735154]
We introduce CAdam, a confidence-based optimization strategy that assesses the consistence between the momentum and the gradient for each parameter dimension before deciding on updates.<n>Our experiments with both synthetic and real-world datasets demonstrate that CAdam surpasses other well-known systems.<n>In large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam.
arXiv Detail & Related papers (2024-11-29T12:00:27Z) - Deconstructing What Makes a Good Optimizer for Language Models [7.9224468703944115]
We compare several optimization algorithms, including SGD, Adafactor, Adam, Lion, and Sophia.<n>No single algorithm emerged as a clear winner in terms of performance or stability to hyperparameter misspecification.
arXiv Detail & Related papers (2024-07-10T18:11:40Z) - An Isometric Stochastic Optimizer [0.0]
Adam is the standard choice in deep learning applications.
I propose a simple explanation of Adam's success: it makes each parameter's step size independent of the norms of the other parameters.
I derive Iso, a new approach which makes the norm of a parameter's update invariant to the application of any linear transformation to its inputs and outputs.
arXiv Detail & Related papers (2023-07-24T17:56:58Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$.
AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$.
We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.