Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle
- URL: http://arxiv.org/abs/2601.21739v1
- Date: Thu, 29 Jan 2026 13:56:11 GMT
- Title: Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle
- Authors: Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ortí,
- Abstract summary: Adam has been at the core of large-scale training for almost a decade.<n>We show that Adam becomes gradient scale invariant of first order if and only if $_1=_2.
- Score: 1.1145952934885128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $β_{1}=β_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $β_{1}=β_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $β_{1}=β_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.
Related papers
- Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime [26.492222550365735]
Adam is the de facto in deep learning, yet its theoretical understanding remains limited.<n>We study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data.<n>We construct a class of structured datasets where incremental Adam provably converges to the $ell_infty$-max-margin.
arXiv Detail & Related papers (2025-10-30T09:41:33Z) - Simple Convergence Proof of Adam From a Sign-like Descent Perspective [58.89890024903816]
We show that Adam achieves the optimal rate of $cal O(frac1Ts14)$ rather than the previous $cal O(fracln TTs14)$.<n>Our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence.
arXiv Detail & Related papers (2025-07-08T13:19:26Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$.
AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$.
We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z) - AdamD: Improved bias-correction in Adam [0.0]
With the default bias-correction, Adam may actually make larger than requested gradient updates early in training.
The default implementation of Adam may be as sensitive as it is to the hyperparameters $beta_1, beta$ partially due to the originally proposed bias correction procedure, and its behavior in early steps.
arXiv Detail & Related papers (2021-10-20T23:55:23Z) - Investigating Alternatives to the Root Mean Square for Adaptive Gradient
Methods [20.531576904743282]
Adam is an adaptive gradient method that has experienced widespread adoption due to its fast and reliable training performance.
Recent approaches have not offered significant improvement over Adam, often because they do not innovate upon one of its core features: normalization by the root mean square (RMS) of recent gradients.
We theoretically and empirically characterize the influence of different $Lp$ norms on adaptive gradient methods for the first time.
arXiv Detail & Related papers (2021-06-10T01:38:37Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z) - A new regret analysis for Adam-type algorithms [78.825194932103]
In theory, regret guarantees for online convex optimization require a rapidly decaying $beta_1to0$ schedule.
We propose a novel framework that allows us to derive optimal, data-dependent regret bounds with a constant $beta_1$.
arXiv Detail & Related papers (2020-03-21T19:19:51Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.