Rethinking Gauss-Newton for learning over-parameterized models
- URL: http://arxiv.org/abs/2302.02904v3
- Date: Tue, 12 Dec 2023 08:40:56 GMT
- Title: Rethinking Gauss-Newton for learning over-parameterized models
- Authors: Michael Arbel and Romain Menegaux and Pierre Wolinski
- Abstract summary: We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning.
We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method.
- Score: 14.780386419851956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies the global convergence and implicit bias of Gauss Newton's
(GN) when optimizing over-parameterized one-hidden layer networks in the
mean-field regime. We first establish a global convergence result for GN in the
continuous-time limit exhibiting a faster convergence rate compared to GD due
to improved conditioning. We then perform an empirical study on a synthetic
regression task to investigate the implicit bias of GN's method. While GN is
consistently faster than GD in finding a global optimum, the learned model
generalizes well on test data when starting from random initial weights with a
small variance and using a small step size to slow down convergence.
Specifically, our study shows that such a setting results in a hidden learning
phenomenon, where the dynamics are able to recover features with good
generalization properties despite the model having sub-optimal training and
test performances due to an under-optimized linear layer. This study exhibits a
trade-off between the convergence speed of GN and the generalization ability of
the learned solution.
Related papers
- On the Convergence of (Stochastic) Gradient Descent for Kolmogorov--Arnold Networks [56.78271181959529]
Kolmogorov--Arnold Networks (KANs) have gained significant attention in the deep learning community.
Empirical investigations demonstrate that KANs optimized via gradient descent (SGD) are capable of achieving near-zero training loss.
arXiv Detail & Related papers (2024-10-10T15:34:10Z) - Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [3.680127959836384]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) in handling certain multi-scale problems.
We show that IGD converges a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z) - Regularized Gauss-Newton for Optimizing Overparameterized Neural Networks [2.0072624123275533]
The generalized Gauss-Newton (GGN) optimization method incorporates curvature estimates into its solution steps.
This work studies a GGN method for optimizing a two-layer neural network with explicit regularization.
arXiv Detail & Related papers (2024-04-23T10:02:22Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Understanding Overparameterization in Generative Adversarial Networks [56.57403335510056]
Generative Adversarial Networks (GANs) are used to train non- concave mini-max optimization problems.
A theory has shown the importance of the gradient descent (GD) to globally optimal solutions.
We show that in an overized GAN with a $1$-layer neural network generator and a linear discriminator, the GDA converges to a global saddle point of the underlying non- concave min-max problem.
arXiv Detail & Related papers (2021-04-12T16:23:37Z) - Convergence Analysis of Homotopy-SGD for non-convex optimization [43.71213126039448]
We present a first-order algorithm based on a combination of homotopy methods and SGD, called Gradienty-Stoch Descent (H-SGD)
Under some assumptions, we conduct a theoretical analysis of the proposed problem.
Experimental results show that H-SGD can outperform SGD.
arXiv Detail & Related papers (2020-11-20T09:50:40Z) - Fast Learning of Graph Neural Networks with Guaranteed Generalizability:
One-hidden-layer Case [93.37576644429578]
Graph neural networks (GNNs) have made great progress recently on learning from graph-structured data in practice.
We provide a theoretically-grounded generalizability analysis of GNNs with one hidden layer for both regression and binary classification problems.
arXiv Detail & Related papers (2020-06-25T00:45:52Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.