From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent
- URL: http://arxiv.org/abs/2210.06705v1
- Date: Thu, 13 Oct 2022 03:55:04 GMT
- Title: From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent
- Authors: Satyen Kale, Jason D. Lee, Chris De Sa, Ayush Sekhari, Karthik
Sridharan
- Abstract summary: Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models.
An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges.
We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
- Score: 50.4531316289086
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Stochastic Gradient Descent (SGD) has been the method of choice for learning
large-scale non-convex models. While a general analysis of when SGD works has
been elusive, there has been a lot of recent progress in understanding the
convergence of Gradient Flow (GF) on the population loss, partly due to the
simplicity that a continuous-time analysis buys us. An overarching theme of our
paper is providing general conditions under which SGD converges, assuming that
GF on the population loss converges. Our main tool to establish this connection
is a general converse Lyapunov like theorem, which implies the existence of a
Lyapunov potential under mild assumptions on the rates of convergence of GF. In
fact, using these potentials, we show a one-to-one correspondence between rates
of convergence of GF and geometrical properties of the underlying objective.
When these potentials further satisfy certain self-bounding properties, we show
that they can be used to provide a convergence guarantee for Gradient Descent
(GD) and SGD (even when the paths of GF and GD/SGD are quite far apart). It
turns out that these self-bounding assumptions are in a sense also necessary
for GD/SGD to work. Using our framework, we provide a unified analysis for
GD/SGD not only for classical settings like convex losses, or objectives that
satisfy PL / KL properties, but also for more complex problems including Phase
Retrieval and Matrix sq-root, and extending the results in the recent work of
Chatterjee 2022.
Related papers
- On the Convergence of (Stochastic) Gradient Descent for Kolmogorov--Arnold Networks [56.78271181959529]
Kolmogorov--Arnold Networks (KANs) have gained significant attention in the deep learning community.
Empirical investigations demonstrate that KANs optimized via gradient descent (SGD) are capable of achieving near-zero training loss.
arXiv Detail & Related papers (2024-10-10T15:34:10Z) - Demystifying the Myths and Legends of Nonconvex Convergence of SGD [17.445810977264067]
gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems.
As our analyses, we addressed certain myths and legends related to the non convergence of the gradient.
arXiv Detail & Related papers (2023-10-19T17:58:59Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - On the Convergence of mSGD and AdaGrad for Stochastic Optimization [0.696125353550498]
convex descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade.
Some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient optimization (AdaGrad)
We focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-possibly non-possibly non-possibly) loss functions in machine learning.
arXiv Detail & Related papers (2022-01-26T22:02:21Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth
Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO.
We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size.
Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z) - Convergence of stochastic gradient descent schemes for
Lojasiewicz-landscapes [0.0]
We consider convergence of gradient descent schemes under weak assumptions on the underlying landscape.
In particular, we show that for neural networks with analytic activation function such as softplus, sigmoid and the hyperbolic tangent, SGD converges on the event of staying bounded.
arXiv Detail & Related papers (2021-02-16T12:42:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.