What is the long-run distribution of stochastic gradient descent? A large deviations analysis
- URL: http://arxiv.org/abs/2406.09241v1
- Date: Thu, 13 Jun 2024 15:44:23 GMT
- Title: What is the long-run distribution of stochastic gradient descent? A large deviations analysis
- Authors: Waïss Azizian, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos,
- Abstract summary: We show that, in the long run, the problem's critical region is visited exponentially more often than any non-critical region.
All other connected components of critical points are visited with frequency that is exponentially proportional to their energy level.
- Score: 29.642830843568525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem's state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem's critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem's minimum energy state (which does not always coincide with the global minimum of the objective); (c) all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally (d) any component of local maximizers or saddle points is "dominated" by a component of local minimizers which is visited exponentially more often.
Related papers
- Thermalization in Trapped Bosonic Systems With Disorder [3.1457219084519004]
We study experimentally accessible states in a system of bosonic atoms trapped in an open linear chain with disorder.
We find that, within certain tolerances, most states in the chaotic region thermalize.
However, states with low participation ratios in the energy eigenstate basis show greater deviations from thermal equilibrium values.
arXiv Detail & Related papers (2024-07-05T19:00:02Z) - Conditional Independence of 1D Gibbs States with Applications to Efficient Learning [0.23301643766310368]
We show that spin chains in thermal equilibrium have a correlation structure in which individual regions are strongly correlated at most with their near vicinity.
We prove that these measures decay superexponentially at every positive temperature.
arXiv Detail & Related papers (2024-02-28T17:28:01Z) - Universality in the tripartite information after global quenches: spin
flip and semilocal charges [0.0]
We study stationary states emerging after global quenches in which the time evolution is under local Hamiltonians.
We show that a localized perturbation in the initial state can turn an exponential decay of spatial correlations in the stationary state into an algebraic decay.
arXiv Detail & Related papers (2023-07-04T17:44:56Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Localization in the random XXZ quantum spin chain [55.2480439325792]
We study the many-body localization (MBL) properties of the Heisenberg XXZ spin-$frac12$ chain in a random magnetic field.
We prove that the system exhibits localization in any given energy interval at the bottom of the spectrum in a nontrivial region of the parameter space.
arXiv Detail & Related papers (2022-10-26T17:25:13Z) - From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent [50.4531316289086]
Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models.
An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges.
We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
arXiv Detail & Related papers (2022-10-13T03:55:04Z) - Role of boundary conditions in the full counting statistics of
topological defects after crossing a continuous phase transition [62.997667081978825]
We analyze the role of boundary conditions in the statistics of topological defects.
We show that for fast and moderate quenches, the cumulants of the kink number distribution present a universal scaling with the quench rate.
arXiv Detail & Related papers (2022-07-08T09:55:05Z) - Emergence of Fermi's Golden Rule [55.73970798291771]
Fermi's Golden Rule (FGR) applies in the limit where an initial quantum state is weakly coupled to a continuum of other final states overlapping its energy.
Here we investigate what happens away from this limit, where the set of final states is discrete, with a nonzero mean level spacing.
arXiv Detail & Related papers (2022-06-01T18:35:21Z) - Localization properties of the asymptotic density distribution of a
one-dimensional disordered system [0.0]
Anderson localization is the ubiquitous phenomenon of inhibition of transport of classical and quantum waves in a disordered medium.
The exact shape of the stationary localized distribution differs from a purely exponential profile and has been computed almost fifty years ago by Gogolin.
Using the atomic quantum kicked rotor, a paradigmatic quantum simulator of Anderson localization physics, we study this distribution.
arXiv Detail & Related papers (2022-03-16T09:40:39Z) - Understanding Long Range Memory Effects in Deep Neural Networks [10.616643031188248]
textitstochastic gradient descent (SGD) is of fundamental importance in deep learning.
In this study, we argue that SGN is neither Gaussian nor stable. Instead, we propose that SGD can be viewed as a discretization of an SDE driven by textitfractional Brownian motion (FBM)
arXiv Detail & Related papers (2021-05-05T13:54:26Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.