When does SGD favor flat minima? A quantitative characterization via
linear stability
- URL: http://arxiv.org/abs/2207.02628v1
- Date: Wed, 6 Jul 2022 12:40:09 GMT
- Title: When does SGD favor flat minima? A quantitative characterization via
linear stability
- Authors: Lei Wu, Mingze Wang, Weijie Su
- Abstract summary: gradient descent (SGD) favors flat minima.
Property of SGD noise provably holds for linear networks and random feature models (RFMs)
- Score: 7.252584656056866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The observation that stochastic gradient descent (SGD) favors flat minima has
played a fundamental role in understanding implicit regularization of SGD and
guiding the tuning of hyperparameters. In this paper, we provide a quantitative
explanation of this striking phenomenon by relating the particular noise
structure of SGD to its \emph{linear stability} (Wu et al., 2018).
Specifically, we consider training over-parameterized models with square loss.
We prove that if a global minimum $\theta^*$ is linearly stable for SGD, then
it must satisfy $\|H(\theta^*)\|_F\leq O(\sqrt{B}/\eta)$, where
$\|H(\theta^*)\|_F, B,\eta$ denote the Frobenius norm of Hessian at $\theta^*$,
batch size, and learning rate, respectively. Otherwise, SGD will escape from
that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD,
the flatness -- as measured by the Frobenius norm of the Hessian -- is bounded
independently of the model size and sample size. The key to obtaining these
results is exploiting the particular geometry awareness of SGD noise: 1) the
noise magnitude is proportional to loss value; 2) the noise directions
concentrate in the sharp directions of local landscape. This property of SGD
noise provably holds for linear networks and random feature models (RFMs) and
is empirically verified for nonlinear networks. Moreover, the validity and
practical relevance of our theoretical findings are justified by extensive
numerical experiments.
Related papers
- Exact Mean Square Linear Stability Analysis for SGD [28.65663421598186]
We provide an explicit condition on the step size that is both necessary and sufficient for linear stability of gradient descent (SGD)
We show that SGD's stability threshold is equivalent to that of a mixture process which takes in each a full batch gradient step w.p. $1-p$, and a single sample gradient step w.p. $p$.
arXiv Detail & Related papers (2023-06-13T15:29:23Z) - Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Label Noise SGD Provably Prefers Flat Global Minimizers [48.883469271546076]
In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.
We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss.
Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
arXiv Detail & Related papers (2021-06-11T17:59:07Z) - The Sobolev Regularization Effect of Stochastic Gradient Descent [8.193914488276468]
We show that flat minima regularize the gradient of the model function, which explains the good performance of flat minima.
We also consider high-order moments of gradient noise, and show that Gradient Dascent (SGD) tends to impose constraints on these moments by a linear analysis of SGD around global minima.
arXiv Detail & Related papers (2021-05-27T21:49:21Z) - Noisy Truncated SGD: Optimization and Generalization [27.33458360279836]
Recent empirical work on SGD has shown that most gradient components over epochs are quite small.
Inspired by such a study, we rigorously study properties of noisy SGD (NT-SGD)
We prove that NT-SGD can provably escape from saddle points and requires less noise compared to previous related work.
arXiv Detail & Related papers (2021-02-26T22:39:41Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z) - Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z) - A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.