A Precise Characterization of SGD Stability Using Loss Surface Geometry
- URL: http://arxiv.org/abs/2401.12332v1
- Date: Mon, 22 Jan 2024 19:46:30 GMT
- Title: A Precise Characterization of SGD Stability Using Loss Surface Geometry
- Authors: Gregory Dexter, Borja Ocejo, Sathiya Keerthi, Aman Gupta, Ayan
Acharya, Rajiv Khanna
- Abstract summary: Descent Gradient (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding.
Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates.
- Score: 8.942671556572073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic Gradient Descent (SGD) stands as a cornerstone optimization
algorithm with proven real-world empirical successes but relatively limited
theoretical understanding. Recent research has illuminated a key factor
contributing to its practical efficacy: the implicit regularization it
instigates. Several studies have investigated the linear stability property of
SGD in the vicinity of a stationary point as a predictive proxy for sharpness
and generalization error in overparameterized neural networks (Wu et al., 2022;
Jastrzebski et al., 2019; Cohen et al., 2021). In this paper, we delve deeper
into the relationship between linear stability and sharpness. More
specifically, we meticulously delineate the necessary and sufficient conditions
for linear stability, contingent on hyperparameters of SGD and the sharpness at
the optimum. Towards this end, we introduce a novel coherence measure of the
loss Hessian that encapsulates pertinent geometric properties of the loss
function that are relevant to the linear stability of SGD. It enables us to
provide a simplified sufficient condition for identifying linear instability at
an optimum. Notably, compared to previous works, our analysis relies on
significantly milder assumptions and is applicable for a broader class of loss
functions than known before, encompassing not only mean-squared error but also
cross-entropy loss.
Related papers
- Exact Mean Square Linear Stability Analysis for SGD [28.65663421598186]
We provide an explicit condition on the step size that is both necessary and sufficient for linear stability of gradient descent (SGD)
We show that SGD's stability threshold is equivalent to that of a mixture process which takes in each a full batch gradient step w.p. $1-p$, and a single sample gradient step w.p. $p$.
arXiv Detail & Related papers (2023-06-13T15:29:23Z) - The Implicit Regularization of Dynamical Stability in Stochastic
Gradient Descent [32.25490196411385]
We study the implicit regularization of gradient descent (SGD) through the lens of em dynamical stability
We analyze the generalization properties of two-layer ReLU networks and diagonal linear networks.
arXiv Detail & Related papers (2023-05-27T14:54:21Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth
Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO.
We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size.
Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z) - Stability of SGD: Tightness Analysis and Improved Bounds [8.831597193643628]
Gradient Descent (SGD) based methods have been widely used for training largescale machine learning models that generalize well in practice.
This paper addresses the question: is analysis [18] tight for smooth functions, and if not, for what kind of loss and data can the analysis improved?
arXiv Detail & Related papers (2021-02-10T05:43:27Z) - On the Stability Properties and the Optimization Landscape of Training
Problems with Squared Loss for Neural Networks and General Nonlinear Conic
Approximation Schemes [0.0]
We study the optimization landscape and the stability properties of training problems with squared loss for neural networks and general nonlinear conic approximation schemes.
We prove that the same effects that are responsible for these instability properties are also the reason for the emergence of saddle points and spurious local minima.
arXiv Detail & Related papers (2020-11-06T11:34:59Z) - Fine-Grained Analysis of Stability and Generalization for Stochastic
Gradient Descent [55.85456985750134]
We introduce a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates.
This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting.
To our best knowledge, this gives the firstever-known stability and generalization for SGD with even non-differentiable loss functions.
arXiv Detail & Related papers (2020-06-15T06:30:19Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.