Related papers: In Search of Adam's Secret Sauce

In Search of Adam's Secret Sauce

URL: http://arxiv.org/abs/2505.21829v1
Date: Tue, 27 May 2025 23:30:18 GMT
Title: In Search of Adam's Secret Sauce
Authors: Antonio Orvieto, Robert Gower,
Abstract summary: We train over 1,300 language models across different data configurations and scales.<n>We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam.<n>We show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients.
Score: 11.215133680044005
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding the remarkable efficacy of Adam when training transformer-based language models has become a central research topic within the optimization community. To gain deeper insights, several simplifications of Adam have been proposed, such as the signed gradient and signed momentum methods. In this work, we conduct an extensive empirical study - training over 1,300 language models across different data configurations and scales - comparing Adam to several known simplified variants. We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam, even after careful tuning of momentum, clipping setting and learning rates. However, our analysis reveals a compelling option that preserves near-optimal performance while allowing for new insightful reformulations: constraining the Adam momentum parameters to be equal. Beyond robust performance, this choice affords new theoretical insights, highlights the "secret sauce" on top of signed momentum, and grants a precise statistical interpretation: we show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients-one that arises from a mean-field Gaussian variational inference perspective.

Related papers

Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling [36.106114687828395]
Adam is known to perform significantly better than Gradient Descent (SGD) in language models.<n>We exhaustively study how momentum, gradient clipping, and batch size affect the gap between SGD and Adam.
arXiv Detail & Related papers (2025-06-14T15:37:31Z)
Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps [65.64965527170156]
We adapt the widely used Adam optimiser for use in reinforcement learning.<n>We show that Adam-Rel uses the local timestep within an epoch, essentially resetting Adam's timestep to 0 after target changes.<n>We then show that increases in gradient norm occur in RL in practice, and examine the differences between our theoretical model and the observed data.
arXiv Detail & Related papers (2024-12-22T18:01:08Z)
CAdam: Confidence-Based Optimization for Online Learning [35.84013976735154]
We introduce CAdam, a confidence-based optimization strategy that assesses the consistence between the momentum and the gradient for each parameter dimension before deciding on updates.<n>Our experiments with both synthetic and real-world datasets demonstrate that CAdam surpasses other well-known systems.<n>In large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam.
arXiv Detail & Related papers (2024-11-29T12:00:27Z)
Deconstructing What Makes a Good Optimizer for Language Models [7.9224468703944115]
We compare several optimization algorithms, including SGD, Adafactor, Adam, Lion, and Sophia.<n>No single algorithm emerged as a clear winner in terms of performance or stability to hyperparameter misspecification.
arXiv Detail & Related papers (2024-07-10T18:11:40Z)
Adam-family Methods with Decoupled Weight Decay in Deep Learning [3.4376560669160394]
We investigate the convergence properties of a wide of Adam-family methods for nonsmooth nonsmooth networks. We propose a novel Adam-family method named Adam with Decoupled Weight Decay (AdamD) in our proposed framework.
arXiv Detail & Related papers (2023-10-13T04:59:44Z)
Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence. Existing convergence analyses for Adam rely on the bounded smoothness assumption. This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z)
A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc. Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z)
How Do Adam and Training Strategies Help BNNs Optimization? [50.22482900678071]
We show that Adam is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability. We derive a simple training scheme, building on existing Adam-based optimization, which achieves 70.5% top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-06-21T17:59:51Z)
Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration [12.744658958445024]
Adam is one of the most influential adaptive algorithms for training deep neural networks. Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge. We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
arXiv Detail & Related papers (2021-01-14T06:42:29Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications. We propose a new method named Adam$+$ (pronounced as Adam-plus) Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.