Related papers: Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise

Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise

URL: http://arxiv.org/abs/2510.13680v1
Date: Wed, 15 Oct 2025 15:36:43 GMT
Title: Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise
Authors: Bingbin Liu, Rachit Bansal, Depen Morwani, Nikhil Vyas, David Alvarez-Melis, Sham M. Kakade,
Abstract summary: Two predominant approaches are based on Adam and Gauss-Newton (GN) methods.<n>We compare these two diagonal preconditioning methods through the lens of two key factors: the choice of basis in the preconditioner, and the impact of noise from mini-batching.
Score: 47.09204878445274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diagonal preconditioners are computationally feasible approximate to second-order optimizers, which have shown significant promise in accelerating training of deep learning models. Two predominant approaches are based on Adam and Gauss-Newton (GN) methods: the former leverages statistics of current gradients and is the de-factor optimizers for neural networks, and the latter uses the diagonal elements of the Gauss-Newton matrix and underpins some of the recent diagonal optimizers such as Sophia. In this work, we compare these two diagonal preconditioning methods through the lens of two key factors: the choice of basis in the preconditioner, and the impact of gradient noise from mini-batching. To gain insights, we analyze these optimizers on quadratic objectives and logistic regression under all four quadrants. We show that regardless of the basis, there exist instances where Adam outperforms both GN$^{-1}$ and GN$^{-1/2}$ in full-batch settings. Conversely, in the stochastic regime, Adam behaves similarly to GN$^{-1/2}$ for linear regression under a Gaussian data assumption. These theoretical results are supported by empirical studies on both convex and non-convex objectives.

Related papers

Bayesian Circular Regression with von Mises Quasi-Processes [57.88921637944379]
In this work we explore a family of expressive and interpretable distributions over circle-valued random functions.<n>For posterior inference, we introduce a new Stratonovich-like augmentation that lends itself to fast Gibbs sampling.<n>We present experiments applying this model to the prediction of wind directions and the percentage of the running gait cycle as a function of joint angles.
arXiv Detail & Related papers (2024-06-19T01:57:21Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
Restricted Strong Convexity of Deep Learning Models with Smooth Activations [31.003601717265006]
We study the problem of optimization of deep learning models with smooth activation functions. We introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) Ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models.
arXiv Detail & Related papers (2022-09-29T21:24:26Z)
NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively. We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z)
Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks. We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z)
Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems. In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z)
Provably Efficient Convergence of Primal-Dual Actor-Critic with Nonlinear Function Approximation [15.319335698574932]
We show the first efficient convergence result with primal-dual actor-critic with a convergence of $mathcalOleft ascent(Nright)Nright)$ under Polyian sampling. Results on Open GymAI continuous control tasks.
arXiv Detail & Related papers (2022-02-28T15:16:23Z)
SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks. The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix. We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z)
Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling? [59.820507600960745]
We propose a new GCP meta-layer that uses SVD in the forward pass, and Pad'e Approximants in the backward propagation to compute the gradients. The proposed meta-layer has been integrated into different CNN models and achieves state-of-the-art performances on both large-scale and fine-grained datasets.
arXiv Detail & Related papers (2021-05-06T08:03:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.