Phase diagram of Stochastic Gradient Descent in high-dimensional
two-layer neural networks
- URL: http://arxiv.org/abs/2202.00293v4
- Date: Wed, 14 Jun 2023 14:15:08 GMT
- Title: Phase diagram of Stochastic Gradient Descent in high-dimensional
two-layer neural networks
- Authors: Rodrigo Veiga, Ludovic Stephan, Bruno Loureiro, Florent Krzakala,
Lenka Zdeborov\'a
- Abstract summary: We investigate the connection between the mean-fieldhydrodynamic regime and the seminal approach of Saad & Solla.
Our work builds on a deterministic description of rates in high-dimensionals from statistical physics.
- Score: 22.823904789355495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the non-convex optimization landscape, over-parametrized shallow
networks are able to achieve global convergence under gradient descent. The
picture can be radically different for narrow networks, which tend to get stuck
in badly-generalizing local minima. Here we investigate the cross-over between
these two regimes in the high-dimensional setting, and in particular
investigate the connection between the so-called mean-field/hydrodynamic regime
and the seminal approach of Saad & Solla. Focusing on the case of Gaussian
data, we study the interplay between the learning rate, the time scale, and the
number of hidden units in the high-dimensional dynamics of stochastic gradient
descent (SGD). Our work builds on a deterministic description of SGD in
high-dimensions from statistical physics, which we extend and for which we
provide rigorous convergence rates.
Related papers
- Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Leveraging the two timescale regime to demonstrate convergence of neural
networks [1.2328446298523066]
We study the training dynamics of neural networks in a two-time regime.
We show that the gradient descent behaves according to our description of the optimum flow gradient, but can fail outside this regime.
arXiv Detail & Related papers (2023-04-19T11:27:09Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - From high-dimensional & mean-field dynamics to dimensionless ODEs: A
unifying approach to SGD in two-layers networks [26.65398696336828]
This manuscript investigates the one-pass gradient descent (SGD) dynamics of a two-layer neural network trained on Gaussian data and labels.
We rigorously analyse the limiting dynamics via a deterministic and low-dimensional description in terms of the sufficient statistics for the population risk.
arXiv Detail & Related papers (2023-02-12T09:50:52Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the non-universality of deep learning: quantifying the cost of
symmetry [24.86176236641865]
We prove computational limitations for learning with neural networks trained by noisy gradient descent (GD)
We characterize functions that fully-connected networks can weak-learn on the binary hypercube and unit sphere.
Our techniques extend to gradient descent (SGD), for which we show nontrivial results for learning with fully-connected networks.
arXiv Detail & Related papers (2022-08-05T11:54:52Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - Dynamical mean-field theory for stochastic gradient descent in Gaussian
mixture classification [25.898873960635534]
We analyze in a closed learning dynamics of gradient descent (SGD) for a single-layer neural network classifying a high-dimensional landscape.
We define a prototype process for which can be extended to a continuous-dimensional gradient flow.
In the full-batch limit, we recover the standard gradient flow.
arXiv Detail & Related papers (2020-06-10T22:49:41Z) - Federated Stochastic Gradient Langevin Dynamics [12.180900849847252]
gradient MCMC methods, such as gradient Langevin dynamics (SGLD), employ fast but noisy gradient estimates to enable large-scale posterior sampling.
We propose conducive gradients, a simple mechanism that combines local likelihood approximations to correct gradient updates.
We demonstrate that our approach can handle delayed communication rounds, converging to the target posterior in cases where DSGLD fails.
arXiv Detail & Related papers (2020-04-23T15:25:09Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.