The Implicit Regularization of Dynamical Stability in Stochastic
Gradient Descent
- URL: http://arxiv.org/abs/2305.17490v2
- Date: Thu, 1 Jun 2023 07:54:48 GMT
- Title: The Implicit Regularization of Dynamical Stability in Stochastic
Gradient Descent
- Authors: Lei Wu, Weijie J. Su
- Abstract summary: We study the implicit regularization of gradient descent (SGD) through the lens of em dynamical stability
We analyze the generalization properties of two-layer ReLU networks and diagonal linear networks.
- Score: 32.25490196411385
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, we study the implicit regularization of stochastic gradient
descent (SGD) through the lens of {\em dynamical stability} (Wu et al., 2018).
We start by revising existing stability analyses of SGD, showing how the
Frobenius norm and trace of Hessian relate to different notions of stability.
Notably, if a global minimum is linearly stable for SGD, then the trace of
Hessian must be less than or equal to $2/\eta$, where $\eta$ denotes the
learning rate. By contrast, for gradient descent (GD), the stability imposes a
similar constraint but only on the largest eigenvalue of Hessian. We then turn
to analyze the generalization properties of these stable minima, focusing
specifically on two-layer ReLU networks and diagonal linear networks. Notably,
we establish the {\em equivalence} between these metrics of sharpness and
certain parameter norms for the two models, which allows us to show that the
stable minima of SGD provably generalize well. By contrast, the
stability-induced regularization of GD is provably too weak to ensure
satisfactory generalization. This discrepancy provides an explanation of why
SGD often generalizes better than GD. Note that the learning rate (LR) plays a
pivotal role in the strength of stability-induced regularization. As the LR
increases, the regularization effect becomes more pronounced, elucidating why
SGD with a larger LR consistently demonstrates superior generalization
capabilities. Additionally, numerical experiments are provided to support our
theoretical findings.
Related papers
- A Precise Characterization of SGD Stability Using Loss Surface Geometry [8.942671556572073]
Descent Gradient (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding.
Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates.
arXiv Detail & Related papers (2024-01-22T19:46:30Z) - Risk Bounds of Accelerated SGD for Overparameterized Linear Regression [75.27846230182885]
Accelerated gradient descent (ASGD) is a workhorse in deep learning.
Existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization.
arXiv Detail & Related papers (2023-11-23T23:02:10Z) - Exact Mean Square Linear Stability Analysis for SGD [28.65663421598186]
We provide an explicit condition on the step size that is both necessary and sufficient for linear stability of gradient descent (SGD)
We show that SGD's stability threshold is equivalent to that of a mixture process which takes in each a full batch gradient step w.p. $1-p$, and a single sample gradient step w.p. $p$.
arXiv Detail & Related papers (2023-06-13T15:29:23Z) - Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on
Least Squares [12.2950446921662]
Recent studies have shown that heavy tails can emerge in optimization and that the heaviness of the tails has links to the generalization error.
We establish novel links between the tail behavior and generalization properties of gradient descent (SGD) through the lens of algorithmic stability.
We support our theory with synthetic and real neural network experiments.
arXiv Detail & Related papers (2022-06-02T19:59:48Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - Stability Based Generalization Bounds for Exponential Family Langevin
Dynamics [21.26469220896393]
We study generalization bounds for noisy mini-batch iterative algorithms based on the notion of stability.
We introduce Exponential Family Langevin Dynamics(EFLD) which is a substantial generalization of SGLD and which allows exponential family noise to be used with gradient descent.
Third, we consider an important special case of EFLD: noisy sign-SGD, which extends sign-SGD using Bernoulli noise over -1,+1.
arXiv Detail & Related papers (2022-01-09T18:15:22Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Fine-Grained Analysis of Stability and Generalization for Stochastic
Gradient Descent [55.85456985750134]
We introduce a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates.
This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting.
To our best knowledge, this gives the firstever-known stability and generalization for SGD with even non-differentiable loss functions.
arXiv Detail & Related papers (2020-06-15T06:30:19Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.