Quantitative Propagation of Chaos for SGD in Wide Neural Networks
- URL: http://arxiv.org/abs/2007.06352v2
- Date: Tue, 14 Jul 2020 06:19:18 GMT
- Title: Quantitative Propagation of Chaos for SGD in Wide Neural Networks
- Authors: Valentin De Bortoli, Alain Durmus, Xavier Fontaine, Umut Simsekli
- Abstract summary: In this paper, we investigate the limiting behavior of a continuous-time counterpart of the Gradient Descent (SGD)
We show 'propagation of chaos' for the particle system defined by this continuous-time dynamics under different scenarios.
We identify two under which different mean-field limits are obtained, one of them corresponding to an implicitly regularized version of the minimization problem at hand.
- Score: 39.35545193410871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate the limiting behavior of a continuous-time
counterpart of the Stochastic Gradient Descent (SGD) algorithm applied to
two-layer overparameterized neural networks, as the number or neurons (ie, the
size of the hidden layer) $N \to +\infty$. Following a probabilistic approach,
we show 'propagation of chaos' for the particle system defined by this
continuous-time dynamics under different scenarios, indicating that the
statistical interaction between the particles asymptotically vanishes. In
particular, we establish quantitative convergence with respect to $N$ of any
particle to a solution of a mean-field McKean-Vlasov equation in the metric
space endowed with the Wasserstein distance. In comparison to previous works on
the subject, we consider settings in which the sequence of stepsizes in SGD can
potentially depend on the number of neurons and the iterations. We then
identify two regimes under which different mean-field limits are obtained, one
of them corresponding to an implicitly regularized version of the minimization
problem at hand. We perform various experiments on real datasets to validate
our theoretical results, assessing the existence of these two regimes on
classification problems and illustrating our convergence results.
Related papers
- Dimension-independent learning rates for high-dimensional classification
problems [53.622581586464634]
We show that every $RBV2$ function can be approximated by a neural network with bounded weights.
We then prove the existence of a neural network with bounded weights approximating a classification function.
arXiv Detail & Related papers (2024-09-26T16:02:13Z) - Non-asymptotic convergence analysis of the stochastic gradient
Hamiltonian Monte Carlo algorithm with discontinuous stochastic gradient with
applications to training of ReLU neural networks [8.058385158111207]
We provide a non-asymptotic analysis of the convergence of the gradient Hamiltonian Monte Carlo to a target measure in Wasserstein-1 and Wasserstein-2 distance.
To illustrate our main results, we consider numerical experiments on quantile estimation and on several problems involving ReLU neural networks relevant in finance and artificial intelligence.
arXiv Detail & Related papers (2024-09-25T17:21:09Z) - Proximal Interacting Particle Langevin Algorithms [0.0]
We introduce Proximal Interacting Particle Langevin Algorithms (PIPLA) for inference and learning in latent variable models.
We propose several variants within the novel proximal IPLA family, tailored to the problem of estimating parameters in a non-differentiable statistical model.
Our theory and experiments together show that PIPLA family can be the de facto choice for parameter estimation problems in latent variable models for non-differentiable models.
arXiv Detail & Related papers (2024-06-20T13:16:41Z) - Convergence analysis of controlled particle systems arising in deep learning: from finite to infinite sample size [1.4325734372991794]
We study the limiting behavior of the associated sampled optimal control problems as the sample size grows to infinity.
We show the convergence of the minima of objective functionals and optimal parameters of the neural SDEs as the sample size N tends to infinity.
arXiv Detail & Related papers (2024-04-08T04:22:55Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Monte Carlo Neural PDE Solver for Learning PDEs via Probabilistic Representation [59.45669299295436]
We propose a Monte Carlo PDE solver for training unsupervised neural solvers.
We use the PDEs' probabilistic representation, which regards macroscopic phenomena as ensembles of random particles.
Our experiments on convection-diffusion, Allen-Cahn, and Navier-Stokes equations demonstrate significant improvements in accuracy and efficiency.
arXiv Detail & Related papers (2023-02-10T08:05:19Z) - Learning Discretized Neural Networks under Ricci Flow [51.36292559262042]
We study Discretized Neural Networks (DNNs) composed of low-precision weights and activations.
DNNs suffer from either infinite or zero gradients due to the non-differentiable discrete function during training.
arXiv Detail & Related papers (2023-02-07T10:51:53Z) - Asymptotic Analysis of Deep Residual Networks [6.308539010172309]
We investigate the properties of deep Residual networks (ResNets) as the number of layers increases.
We first show the existence of scaling regimes for trained weights markedly different from those implicitly assumed in the neural ODE literature.
We study the hidden state dynamics in these scaling regimes, showing that one may obtain an ODE, a convergence equation (SDE) or neither of these.
arXiv Detail & Related papers (2022-12-15T23:55:01Z) - High-dimensional limit theorems for SGD: Effective dynamics and critical
scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD)
We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss.
About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z) - Multipole Graph Neural Operator for Parametric Partial Differential
Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data.
We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity.
Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.