Related papers: Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold

Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold

URL: http://arxiv.org/abs/2511.02773v1
Date: Tue, 04 Nov 2025 17:58:57 GMT
Title: Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold
Authors: Xinghan Li, Haodong Wen, Kaifeng Lyu,
Abstract summary: We show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from Gradient Descent.<n>More specifically, when the loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner.
Score: 14.185079197889806
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the popularity of the Adam optimizer in practice, most theoretical analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and little is known about how the solutions found by Adam differ. In this paper, we show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. More specifically, when the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize through a continuous-time approximation using stochastic differential equations. We further demonstrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, $\tr(\mH)$, whereas we prove that Adam minimizes $\tr(\Diag(\mH)^{1/2})$ instead. In solving sparse linear regression with diagonal linear networks, this distinction enables Adam to achieve better sparsity and generalization than SGD. Finally, our analysis framework extends beyond Adam to a broad class of adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo, and provides a unified perspective on how these adaptive optimizers reduce sharpness, which we hope will offer insights for future optimizer design.

Related papers

In-Run Data Shapley for Adam Optimizer [13.904612598915165]
We propose Adam-Aware In-Run Data Shapley, which restores additivity by redefining utility under a fixed-state assumption.<n>Our method achieves near-perfect fidelity to ground-Pearson marginal contributions while retaining $sim$95% of standard training.
arXiv Detail & Related papers (2026-01-30T21:31:40Z)
Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks [38.11287525994738]
We present the first theoretical characterization of how affects Adam's generalization.<n>Our results reveal that while both Adam and AdamW with proper weight decay converge to poor test error solutions, their mini-batch variants can achieve near-zero test error.
arXiv Detail & Related papers (2025-10-13T12:48:22Z)
A Simplified Analysis of SGD for Linear Regression with Weight Averaging [64.2393952273612]
Recent work bycitetzou 2021benign provides sharp rates for SGD optimization in linear regression using constant learning rate.<n>We provide a simplified analysis recovering the same bias and variance bounds provided incitepzou 2021benign based on simple linear algebra tools.<n>We believe our work makes the analysis of gradient descent on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling.
arXiv Detail & Related papers (2025-06-18T15:10:38Z)
The Rich and the Simple: On the Implicit Bias of Adam and SGD [26.722625797251553]
Adam is the de facto optimization algorithm for several deep learning applications.<n>In practice, neural networks (NNs) trained with (stochastic) gradient descent (GD) are known to exhibit simplicity bias.<n>We show that Adam is more resistant to such simplicity bias.
arXiv Detail & Related papers (2025-05-29T21:46:12Z)
AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training [22.58304858379219]
We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training.<n>By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates.<n>AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance.
arXiv Detail & Related papers (2025-05-22T08:16:48Z)
Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence. Existing convergence analyses for Adam rely on the bounded smoothness assumption. This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z)
Understanding AdamW through Proximal Methods and Scale-Freeness [57.47324825501137]
Adam is a generalization of the $ell$ regularizer Adam-$ell$. AdamW decouples the gradient of Adam-$ell$ from the update rule of Adam-$ell$. We show that AdamW exhibits an advantage over Adam-$ell$ and the degree to which we expect the gradients of the network to exhibit multiple scales.
arXiv Detail & Related papers (2022-01-31T21:00:55Z)
A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc. Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration [12.744658958445024]
Adam is one of the most influential adaptive algorithms for training deep neural networks. Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge. We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
arXiv Detail & Related papers (2021-01-14T06:42:29Z)
Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications. We propose a new method named Adam$+$ (pronounced as Adam-plus) Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z)
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate. This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.