On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective
- URL: http://arxiv.org/abs/2112.00987v1
- Date: Thu, 2 Dec 2021 05:24:05 GMT
- Title: On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective
- Authors: Xiaowu Dai and Yuhua Zhu
- Abstract summary: We study the statistical properties of the dynamic trajectory of gradient descent (SGD)
We exploit the continuous formulation of SDE and the theory of Fokker-Planck equations to develop new results on escaping phenomenon and relationship with large batch and sharp minima.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the statistical properties of the dynamic trajectory of stochastic
gradient descent (SGD). We approximate the mini-batch SGD and the momentum SGD
as stochastic differential equations (SDEs). We exploit the continuous
formulation of SDE and the theory of Fokker-Planck equations to develop new
results on the escaping phenomenon and the relationship with large batch and
sharp minima. In particular, we find that the stochastic process solution tends
to converge to flatter minima regardless of the batch size in the asymptotic
regime. However, the convergence rate is rigorously proven to depend on the
batch size. These results are validated empirically with various datasets and
models.
Related papers
- Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent [6.3151583550712065]
We study the dynamics of a continuous-time model of the Gradient Descent (SGD)
We analyze degenerate Differential Equations (squareSDEs) that model SGD either in the case of the training loss (finite samples) or the population one (online setting)
arXiv Detail & Related papers (2024-07-02T14:52:21Z) - A Hessian-Aware Stochastic Differential Equation for Modelling SGD [28.974147174627102]
Hessian-Aware Modified Equation (HA-SME) is an approximation SDE that incorporates Hessian information of the objective function into both its drift and diffusion terms.
For quadratic objectives, HA-SME is proved to be the first SDE model that recovers exactly the SGD dynamics in the distributional sense.
arXiv Detail & Related papers (2024-05-28T17:11:34Z) - Gaussian Mixture Solvers for Diffusion Models [84.83349474361204]
We introduce a novel class of SDE-based solvers called GMS for diffusion models.
Our solver outperforms numerous SDE-based solvers in terms of sample quality in image generation and stroke-based synthesis.
arXiv Detail & Related papers (2023-11-02T02:05:38Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - A Geometric Perspective on Diffusion Models [57.27857591493788]
We inspect the ODE-based sampling of a popular variance-exploding SDE.
We establish a theoretical relationship between the optimal ODE-based sampling and the classic mean-shift (mode-seeking) algorithm.
arXiv Detail & Related papers (2023-05-31T15:33:16Z) - Continuous-time stochastic gradient descent for optimizing over the
stationary distribution of stochastic differential equations [7.65995376636176]
We develop a new continuous-time gradient descent method for optimizing over the stationary distribution oficity differential equation (SDE) models.
We rigorously prove convergence of the online forward propagation algorithm for linear SDE models and present its numerical results for nonlinear examples.
arXiv Detail & Related papers (2022-02-14T11:45:22Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Learning effective stochastic differential equations from microscopic
simulations: combining stochastic numerics and deep learning [0.46180371154032895]
We approximate drift and diffusivity functions in effective SDE through neural networks.
Our approach does not require long trajectories, works on scattered snapshot data, and is designed to naturally handle different time steps per snapshot.
arXiv Detail & Related papers (2021-06-10T13:00:18Z) - Amortized variance reduction for doubly stochastic objectives [17.064916635597417]
Approximate inference in complex probabilistic models requires optimisation of doubly objective functions.
Current approaches do not take into account how mini-batchity affects samplingity, resulting in sub-optimal variance reduction.
We propose a new approach in which we use a recognition network to cheaply approximate the optimal control variate for each mini-batch, with no additional gradient computations.
arXiv Detail & Related papers (2020-03-09T13:23:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.