From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent
- URL: http://arxiv.org/abs/2210.06705v1
- Date: Thu, 13 Oct 2022 03:55:04 GMT
- Title: From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent
- Authors: Satyen Kale, Jason D. Lee, Chris De Sa, Ayush Sekhari, Karthik
Sridharan
- Abstract summary: Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models.
An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges.
We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
- Score: 50.4531316289086
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Stochastic Gradient Descent (SGD) has been the method of choice for learning
large-scale non-convex models. While a general analysis of when SGD works has
been elusive, there has been a lot of recent progress in understanding the
convergence of Gradient Flow (GF) on the population loss, partly due to the
simplicity that a continuous-time analysis buys us. An overarching theme of our
paper is providing general conditions under which SGD converges, assuming that
GF on the population loss converges. Our main tool to establish this connection
is a general converse Lyapunov like theorem, which implies the existence of a
Lyapunov potential under mild assumptions on the rates of convergence of GF. In
fact, using these potentials, we show a one-to-one correspondence between rates
of convergence of GF and geometrical properties of the underlying objective.
When these potentials further satisfy certain self-bounding properties, we show
that they can be used to provide a convergence guarantee for Gradient Descent
(GD) and SGD (even when the paths of GF and GD/SGD are quite far apart). It
turns out that these self-bounding assumptions are in a sense also necessary
for GD/SGD to work. Using our framework, we provide a unified analysis for
GD/SGD not only for classical settings like convex losses, or objectives that
satisfy PL / KL properties, but also for more complex problems including Phase
Retrieval and Matrix sq-root, and extending the results in the recent work of
Chatterjee 2022.
Related papers
- On the Convergence of DP-SGD with Adaptive Clipping [56.24689348875711]
Gradient Descent with gradient clipping is a powerful technique for enabling differentially private optimization.
This paper provides the first comprehensive convergence analysis of SGD with quantile clipping (QC-SGD)
We show how QC-SGD suffers from a bias problem similar to constant-threshold clipped SGD but can be mitigated through a carefully designed quantile and step size schedule.
arXiv Detail & Related papers (2024-12-27T20:29:47Z) - On the Convergence of (Stochastic) Gradient Descent for Kolmogorov--Arnold Networks [56.78271181959529]
Kolmogorov--Arnold Networks (KANs) have gained significant attention in the deep learning community.
Empirical investigations demonstrate that KANs optimized via gradient descent (SGD) are capable of achieving near-zero training loss.
arXiv Detail & Related papers (2024-10-10T15:34:10Z) - Demystifying the Myths and Legends of Nonconvex Convergence of SGD [17.445810977264067]
gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems.
As our analyses, we addressed certain myths and legends related to the non convergence of the gradient.
arXiv Detail & Related papers (2023-10-19T17:58:59Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - Global Convergence of SGD On Two Layer Neural Nets [0.2302001830524133]
We consider appropriately regularized $ell-$empirical risk of depth $2$ nets with any number of gates.
We show bounds on how the empirical loss evolves for SGD unboundeds on it -- for arbitrary data and if the activation is adequately smooth and bounded like sigmoid and tanh.
arXiv Detail & Related papers (2022-10-20T17:50:46Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - On the Convergence of mSGD and AdaGrad for Stochastic Optimization [0.696125353550498]
convex descent (SGD) has been intensively developed and extensively applied in machine learning in the past decade.
Some modified SGD-type algorithms, which outperform the SGD in many competitions and applications in terms of convergence rate and accuracy, such as momentum-based SGD (mSGD) and adaptive gradient optimization (AdaGrad)
We focus on convergence analysis of mSGD and AdaGrad for any smooth (possibly non-possibly non-possibly non-possibly) loss functions in machine learning.
arXiv Detail & Related papers (2022-01-26T22:02:21Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.