Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2407.02322v1
- Date: Tue, 2 Jul 2024 14:52:21 GMT
- Title: Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent
- Authors: Adrien Schertzer, Loucas Pillaud-Vivien,
- Abstract summary: We study the dynamics of a continuous-time model of the Gradient Descent (SGD)
We analyze degenerate Differential Equations (squareSDEs) that model SGD either in the case of the training loss (finite samples) or the population one (online setting)
- Score: 6.3151583550712065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the dynamics of a continuous-time model of the Stochastic Gradient Descent (SGD) for the least-square problem. Indeed, pursuing the work of Li et al. (2019), we analyze Stochastic Differential Equations (SDEs) that model SGD either in the case of the training loss (finite samples) or the population one (online setting). A key qualitative feature of the dynamics is the existence of a perfect interpolator of the data, irrespective of the sample size. In both scenarios, we provide precise, non-asymptotic rates of convergence to the (possibly degenerate) stationary distribution. Additionally, we describe this asymptotic distribution, offering estimates of its mean, deviations from it, and a proof of the emergence of heavy-tails related to the step-size magnitude. Numerical simulations supporting our findings are also presented.
Related papers
- Convergence of Score-Based Discrete Diffusion Models: A Discrete-Time Analysis [56.442307356162864]
We study the theoretical aspects of score-based discrete diffusion models under the Continuous Time Markov Chain (CTMC) framework.
We introduce a discrete-time sampling algorithm in the general state space $[S]d$ that utilizes score estimators at predefined time points.
Our convergence analysis employs a Girsanov-based method and establishes key properties of the discrete score function.
arXiv Detail & Related papers (2024-10-03T09:07:13Z) - A Hessian-Aware Stochastic Differential Equation for Modelling SGD [28.974147174627102]
Hessian-Aware Modified Equation (HA-SME) is an approximation SDE that incorporates Hessian information of the objective function into both its drift and diffusion terms.
For quadratic objectives, HA-SME is proved to be the first SDE model that recovers exactly the SGD dynamics in the distributional sense.
arXiv Detail & Related papers (2024-05-28T17:11:34Z) - On the Trajectory Regularity of ODE-based Diffusion Sampling [79.17334230868693]
Diffusion-based generative models use differential equations to establish a smooth connection between a complex data distribution and a tractable prior distribution.
In this paper, we identify several intriguing trajectory properties in the ODE-based sampling process of diffusion models.
arXiv Detail & Related papers (2024-05-18T15:59:41Z) - Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on
GLMs and multi-index models [10.781866671930857]
We analyze the dynamics of streaming gradient descent (SGD) in the high-dimensional limit.
We demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations.
In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient.
arXiv Detail & Related papers (2023-08-17T13:33:02Z) - Exploring the Optimal Choice for Generative Processes in Diffusion
Models: Ordinary vs Stochastic Differential Equations [6.2284442126065525]
We study the problem mathematically for two limiting scenarios: the zero diffusion (ODE) case and the large diffusion case.
Our findings indicate that when the perturbation occurs at the end of the generative process, the ODE model outperforms the SDE model with a large diffusion coefficient.
arXiv Detail & Related papers (2023-06-03T09:27:15Z) - A Geometric Perspective on Diffusion Models [57.27857591493788]
We inspect the ODE-based sampling of a popular variance-exploding SDE.
We establish a theoretical relationship between the optimal ODE-based sampling and the classic mean-shift (mode-seeking) algorithm.
arXiv Detail & Related papers (2023-05-31T15:33:16Z) - Score-based Continuous-time Discrete Diffusion Models [102.65769839899315]
We extend diffusion models to discrete variables by introducing a Markov jump process where the reverse process denoises via a continuous-time Markov chain.
We show that an unbiased estimator can be obtained via simple matching the conditional marginal distributions.
We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
arXiv Detail & Related papers (2022-11-30T05:33:29Z) - On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective [0.0]
We study the statistical properties of the dynamic trajectory of gradient descent (SGD)
We exploit the continuous formulation of SDE and the theory of Fokker-Planck equations to develop new results on escaping phenomenon and relationship with large batch and sharp minima.
arXiv Detail & Related papers (2021-12-02T05:24:05Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Accurate Characterization of Non-Uniformly Sampled Time Series using
Stochastic Differential Equations [0.0]
Non-uniform sampling arises when an experimenter does not have full control over the sampling characteristics of the process under investigation.
We introduce new initial estimates for the numerical optimization of the likelihood.
We show the increased accuracy achieved the new estimator in simulation experiments.
arXiv Detail & Related papers (2020-07-02T13:03:09Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.