Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
- URL: http://arxiv.org/abs/2603.03099v1
- Date: Tue, 03 Mar 2026 15:34:51 GMT
- Title: Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails
- Authors: Ruinan Jin, Yingbin Liang, Shaofeng Zou,
- Abstract summary: We uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that distinguishes Adam from SGD.<n>In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods.
- Score: 66.18297682243694
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $δ^{-1/2}$ dependence on the confidence parameter $δ$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $δ^{-1}$ dependence.
Related papers
- Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization [62.48819955422706]
We study the long-term tail decay of SGD-based methods through the lens of large deviations theory.<n>We uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.
arXiv Detail & Related papers (2026-02-05T13:41:13Z) - Simple Convergence Proof of Adam From a Sign-like Descent Perspective [58.89890024903816]
We show that Adam achieves the optimal rate of $cal O(frac1Ts14)$ rather than the previous $cal O(fracln TTs14)$.<n>Our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence.
arXiv Detail & Related papers (2025-07-08T13:19:26Z) - Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity [6.270305440413688]
We find that Adam performs worse when the favorable $ell_infty$-geometry is SGD while provably remains unaffected.<n>Our experiments confirm that Adam performs much worse when the favorable $ell_infty$-geometry is SGD while provably remains unaffected.
arXiv Detail & Related papers (2024-10-10T17:58:53Z) - On the Convergence of Adam under Non-uniform Smoothness: Separability from SGDM and Beyond [35.65852208995095]
We demonstrate that Adam achieves a faster convergence compared to SGDM under the condition of non-uniformly bounded smoothness.
Our findings reveal that: (1) in deterministic environments, Adam can attain the known lower bound for the convergence rate of deterministic first-orders, whereas the convergence rate of Gradient Descent with Momentum (GDM) has higher order dependence on the initial function value.
arXiv Detail & Related papers (2024-03-22T11:57:51Z) - High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise [59.25598762373543]
We show that wetailed high-prob convergence guarantees of learning on streaming data in the presence of heavy-tailed noise.
We demonstrate analytically and that $ta$ can be used to the preferred choice of setting for a given problem.
arXiv Detail & Related papers (2023-10-28T18:53:41Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - AdaSGD: Bridging the gap between SGD and Adam [14.886598905466604]
We identify potential differences in performance between SGD and Adam.
We demonstrate how AdaSGD combines the benefits both SGD Adam and SGD non- descent.
arXiv Detail & Related papers (2020-06-30T05:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.