Global Convergence of Second-order Dynamics in Two-layer Neural Networks
- URL: http://arxiv.org/abs/2007.06852v1
- Date: Tue, 14 Jul 2020 07:01:57 GMT
- Title: Global Convergence of Second-order Dynamics in Two-layer Neural Networks
- Authors: Walid Krichene, Kenneth F. Caluya, Abhishek Halder
- Abstract summary: Recent results have shown that for two-layer fully connected neural networks, gradient flow converges to a global optimum in the infinite width limit.
We show that the answer is positive for the heavy ball method.
While our results are functional in the mean field limit, numerical simulations indicate that global convergence may already occur for reasonably small networks.
- Score: 10.415177082023389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent results have shown that for two-layer fully connected neural networks,
gradient flow converges to a global optimum in the infinite width limit, by
making a connection between the mean field dynamics and the Wasserstein
gradient flow. These results were derived for first-order gradient flow, and a
natural question is whether second-order dynamics, i.e., dynamics with
momentum, exhibit a similar guarantee. We show that the answer is positive for
the heavy ball method. In this case, the resulting integro-PDE is a nonlinear
kinetic Fokker Planck equation, and unlike the first-order case, it has no
apparent connection with the Wasserstein gradient flow. Instead, we study the
variations of a Lyapunov functional along the solution trajectories to
characterize the stationary points and to prove convergence. While our results
are asymptotic in the mean field limit, numerical simulations indicate that
global convergence may already occur for reasonably small networks.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Proving Linear Mode Connectivity of Neural Networks via Optimal
Transport [27.794244660649085]
We provide a framework theoretically explaining this empirical observation.
We show how the support weight distribution neurons, which dictates Wasserstein convergence rates is correlated with mode connectivity.
arXiv Detail & Related papers (2023-10-29T18:35:05Z) - Approximation Results for Gradient Descent trained Neural Networks [0.0]
The networks are fully connected constant depth increasing width.
The continuous kernel error norm implies an approximation under the natural smoothness assumption required for smooth functions.
arXiv Detail & Related papers (2023-09-09T18:47:55Z) - Leveraging the two timescale regime to demonstrate convergence of neural
networks [1.2328446298523066]
We study the training dynamics of neural networks in a two-time regime.
We show that the gradient descent behaves according to our description of the optimum flow gradient, but can fail outside this regime.
arXiv Detail & Related papers (2023-04-19T11:27:09Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Convex Analysis of the Mean Field Langevin Dynamics [49.66486092259375]
convergence rate analysis of the mean field Langevin dynamics is presented.
$p_q$ associated with the dynamics allows us to develop a convergence theory parallel to classical results in convex optimization.
arXiv Detail & Related papers (2022-01-25T17:13:56Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Non-asymptotic approximations of neural networks by Gaussian processes [7.56714041729893]
We study the extent to which wide neural networks may be approximated by Gaussian processes when with random weights.
As the width of a network goes to infinity, its law converges to that of a Gaussian process.
arXiv Detail & Related papers (2021-02-17T10:19:26Z) - A Dynamical Central Limit Theorem for Shallow Neural Networks [48.66103132697071]
We prove that the fluctuations around the mean limit remain bounded in mean square throughout training.
If the mean-field dynamics converges to a measure that interpolates the training data, we prove that the deviation eventually vanishes in the CLT scaling.
arXiv Detail & Related papers (2020-08-21T18:00:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.