Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models
- URL: http://arxiv.org/abs/2511.03972v2
- Date: Thu, 13 Nov 2025 01:27:41 GMT
- Title: Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models
- Authors: Semih Cayci,
- Abstract summary: We analyze a matrix Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling.<n>Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton along the optimization path yields tighter stability bounds.
- Score: 5.076419064097734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An important question in deep learning is how higher-order optimization methods affect generalization. In this work, we analyze a stochastic Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling for training overparameterized deep neural networks with smooth activations in a regression setting. Our theoretical contributions are twofold. First, we establish finite-time convergence bounds via a variable-metric analysis in parameter space, with explicit dependencies on the batch size, network width and depth. Second, we derive non-asymptotic generalization bounds for SGN using uniform stability in the overparameterized regime, characterizing the impact of curvature, batch size, and overparameterization on generalization performance. Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton matrix along the optimization path yields tighter stability bounds.
Related papers
- Towards a Unified Analysis of Neural Networks in Nonparametric Instrumental Variable Regression: Optimization and Generalization [66.08522228989634]
We establish the first global convergence result of neural networks for two stage least squares (2SLS) approach in nonparametric instrumental variable regression (NPIV)<n>This is achieved by adopting a lifted perspective through mean-field Langevin dynamics (MFLD)
arXiv Detail & Related papers (2025-11-18T17:51:17Z) - A Riemannian Optimization Perspective of the Gauss-Newton Method for Feedforward Neural Networks [3.48097307252416]
We analyze the convergence of Gauss-Newton dynamics for training neural networks with smooth activation functions.<n>We show that the Levenberg-Marquardt dynamics with an appropriately chosen damping schedule yields fast convergence rate despite potentially ill-conditioned neural tangent kernel matrices.
arXiv Detail & Related papers (2024-12-18T16:51:47Z) - Exact Gauss-Newton Optimization for Training Deep Neural Networks [5.249805590164902]
We present Exact Gauss-Newton (EGN), a second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction.<n>We show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm.<n>Our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the performance of well-tuned SGD, Adam, SQN, and SGN generalizations across various supervised and reinforcement learning tasks.
arXiv Detail & Related papers (2024-05-23T10:21:05Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements.
The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively.
We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Single Trajectory Nonparametric Learning of Nonlinear Dynamics [8.438421942654292]
Given a single trajectory of a dynamical system, we analyze the performance of the nonparametric least squares estimator (LSE)
We leverage recently developed information-theoretic methods to establish the optimality of the LSE for non hypotheses classes.
We specialize our results to a number of scenarios of practical interest, such as Lipschitz dynamics, generalized linear models, and dynamics described by functions in certain classes of Reproducing Kernel Hilbert Spaces (RKHS)
arXiv Detail & Related papers (2022-02-16T19:38:54Z) - Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector
Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm.
We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z) - Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models [0.2741266294612776]
We analyze a class of gradient algorithms with momentum on a high-dimensional random least squares problem.
We show that (small-batch) momentum with a fixed momentum parameter provides no actual performance improvement over SGD when step sizes are adjusted correctly.
In the non-strongly convex setting, it is possible to get a large improvement over SGD using momentum.
arXiv Detail & Related papers (2021-06-07T15:08:24Z) - Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to
Improve Generalization [89.7882166459412]
gradient noise (SGN) acts as implicit regularization for deep learning.
Some works attempted to artificially simulate SGN by injecting random noise to improve deep learning.
For simulating SGN at low computational costs and without changing the learning rate or batch size, we propose the Positive-Negative Momentum (PNM) approach.
arXiv Detail & Related papers (2021-03-31T16:08:06Z) - Bayesian Sparse learning with preconditioned stochastic gradient MCMC
and its applications [5.660384137948734]
The proposed algorithm converges to the correct distribution with a controllable bias under mild conditions.
We show that the proposed algorithm canally converge to the correct distribution with a controllable bias under mild conditions.
arXiv Detail & Related papers (2020-06-29T20:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.