Related papers: Gauss-Newton Dynamics for Neural Networks: A Riemannian Optimization Perspective

Gauss-Newton Dynamics for Neural Networks: A Riemannian Optimization Perspective

URL: http://arxiv.org/abs/2412.14031v3
Date: Fri, 20 Dec 2024 15:58:45 GMT
Title: Gauss-Newton Dynamics for Neural Networks: A Riemannian Optimization Perspective
Authors: Semih Cayci,
Abstract summary: We analyze the convergence of Gauss-Newton dynamics for training neural networks with smooth activation functions.<n>We show that the Levenberg-Marquardt dynamics with an appropriately chosen damping factor yields robustness to ill-conditioned kernels.
Score: 3.48097307252416
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We analyze the convergence of Gauss-Newton dynamics for training neural networks with smooth activation functions. In the underparameterized regime, the Gauss-Newton gradient flow induces a Riemannian gradient flow on a low-dimensional, smooth, embedded submanifold of the Euclidean output space. Using tools from Riemannian optimization, we prove \emph{last-iterate} convergence of the Riemannian gradient flow to the optimal in-class predictor at an \emph{exponential rate} that is independent of the conditioning of the Gram matrix, \emph{without} requiring explicit regularization. We further characterize the critical impacts of the neural network scaling factor and the initialization on the convergence behavior. In the overparameterized regime, we show that the Levenberg-Marquardt dynamics with an appropriately chosen damping factor yields robustness to ill-conditioned kernels, analogous to the underparameterized regime. These findings demonstrate the potential of Gauss-Newton methods for efficiently optimizing neural networks, particularly in ill-conditioned problems where kernel and Gram matrices have small singular values.

Related papers

Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration [67.12978375116599]
We show that the steps of gradient descent (GD) reduce to those of generalized perceptron algorithms.<n>This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks.
arXiv Detail & Related papers (2025-12-12T14:16:35Z)
Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models [5.076419064097734]
We analyze a matrix Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling.<n>Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton along the optimization path yields tighter stability bounds.
arXiv Detail & Related papers (2025-11-06T01:50:45Z)
Convergence of Stochastic Gradient Methods for Wide Two-Layer Physics-Informed Neural Networks [0.6319731355340598]
In practice, one often employs gradient descent type algorithms to train the neural network.<n>In this work, we establish the linear convergence of gradient descent / flow in training over-ized two layer PINNs for a general class of activation functions in the sense of high probability.
arXiv Detail & Related papers (2025-08-29T12:25:51Z)
Non-asymptotic convergence analysis of the stochastic gradient Hamiltonian Monte Carlo algorithm with discontinuous stochastic gradient with applications to training of ReLU neural networks [7.170021086161168]
We provide a non-asymptotic analysis of the convergence of the gradient Hamiltonian Monte Carlo to a target measure in Wasserstein-1 and Wasserstein-2 distance.<n>To illustrate our main results, we consider numerical experiments on quantile estimation and on several problems involving ReLU neural networks relevant in finance and artificial intelligence.
arXiv Detail & Related papers (2024-09-25T17:21:09Z)
On the Convergence Analysis of Over-Parameterized Variational Autoencoders: A Neural Tangent Kernel Perspective [7.580900499231056]
Variational Auto-Encoders (VAEs) have emerged as powerful probabilistic models for generative tasks. This paper provides a mathematical proof of VAE under mild assumptions. We also establish a novel connection between the optimization problem faced by over-Eized SNNs and the Kernel Ridge (KRR) problem.
arXiv Detail & Related papers (2024-09-09T06:10:31Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
A Structure-Guided Gauss-Newton Method for Shallow ReLU Neural Network [18.06366638807982]
We propose a structure-guided Gauss-Newton (SgGN) method for solving least squares problems using a shallow ReLU neural network. The method effectively takes advantage of both the least squares structure and the neural network structure of the objective function.
arXiv Detail & Related papers (2024-04-07T20:24:44Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
On Learning Gaussian Multi-index Models with Gradient Flow [57.170617397894404]
We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection.
arXiv Detail & Related papers (2023-10-30T17:55:28Z)
Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks. We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z)
Convex Analysis of the Mean Field Langevin Dynamics [49.66486092259375]
convergence rate analysis of the mean field Langevin dynamics is presented. $p_q$ associated with the dynamics allows us to develop a convergence theory parallel to classical results in convex optimization.
arXiv Detail & Related papers (2022-01-25T17:13:56Z)
Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks [0.0]
We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow. We show that the network learns eigenfunctions of an integral operator $T_Kinfty$ determined by the Neural Tangent Kernel (NTK) We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.
arXiv Detail & Related papers (2022-01-12T23:28:41Z)
The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well. An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve. We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z)
Provably Efficient Neural Estimation of Structural Equation Model: An Adversarial Approach [144.21892195917758]
We study estimation in a class of generalized Structural equation models (SEMs) We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using a gradient descent. For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting.
arXiv Detail & Related papers (2020-07-02T17:55:47Z)
Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate. We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
Stable Neural Flows [15.318500611972441]
We introduce a provably stable variant of neural ordinary differential equations (neural ODEs) whose trajectories evolve on an energy functional parametrised by a neural network. The learning procedure is cast as an optimal control problem, and an approximate solution is proposed based on adjoint sensivity analysis.
arXiv Detail & Related papers (2020-03-18T06:27:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.