Label Noise SGD Provably Prefers Flat Global Minimizers
- URL: http://arxiv.org/abs/2106.06530v1
- Date: Fri, 11 Jun 2021 17:59:07 GMT
- Title: Label Noise SGD Provably Prefers Flat Global Minimizers
- Authors: Alex Damian, Tengyu Ma, Jason Lee
- Abstract summary: In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.
We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss.
Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
- Score: 48.883469271546076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In overparametrized models, the noise in stochastic gradient descent (SGD)
implicitly regularizes the optimization trajectory and determines which local
minimum SGD converges to. Motivated by empirical studies that demonstrate that
training with noisy labels improves generalization, we study the implicit
regularization effect of SGD with label noise. We show that SGD with label
noise converges to a stationary point of a regularized loss $L(\theta) +\lambda
R(\theta)$, where $L(\theta)$ is the training loss, $\lambda$ is an effective
regularization parameter depending on the step size, strength of the label
noise, and the batch size, and $R(\theta)$ is an explicit regularizer that
penalizes sharp minimizers. Our analysis uncovers an additional regularization
effect of large learning rates beyond the linear scaling rule that penalizes
large eigenvalues of the Hessian more than small ones. We also prove extensions
to classification with general loss functions, SGD with momentum, and SGD with
general noise covariance, significantly strengthening the prior work of Blanc
et al. to global convergence and large learning rates and of HaoChen et al. to
general models.
Related papers
- On the Trajectories of SGD Without Replacement [0.0]
This article examines the implicit regularization effect of Gradient Descent (SGD)
We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks.
arXiv Detail & Related papers (2023-12-26T18:06:48Z) - Doubly Stochastic Models: Learning with Unbiased Label Noises and
Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent.
We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters.
Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z) - Why is parameter averaging beneficial in SGD? An objective smoothing perspective [13.863368438870562]
gradient descent (SGD) and its implicit bias are often characterized in terms of the sharpness of the minima.
We study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al.
We prove that averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima.
arXiv Detail & Related papers (2023-02-18T16:29:06Z) - When does SGD favor flat minima? A quantitative characterization via
linear stability [7.252584656056866]
gradient descent (SGD) favors flat minima.
Property of SGD noise provably holds for linear networks and random feature models (RFMs)
arXiv Detail & Related papers (2022-07-06T12:40:09Z) - Optimal Online Generalized Linear Regression with Stochastic Noise and
Its Application to Heteroscedastic Bandits [88.6139446295537]
We study the problem of online generalized linear regression in the setting of a generalized linear model with possibly unbounded additive noise.
We provide a sharp analysis of the classical follow-the-regularized-leader (FTRL) algorithm to cope with the label noise.
We propose an algorithm based on FTRL to achieve the first variance-aware regret bound.
arXiv Detail & Related papers (2022-02-28T08:25:26Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - SGD Generalizes Better Than GD (And Regularization Doesn't Help) [39.588906680621825]
We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD)
We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error.
We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
arXiv Detail & Related papers (2021-02-01T19:18:40Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Improved generalization by noise enhancement [5.33024001730262]
Noise in gradient descent (SGD) is closely related to generalization.
We propose a method that achieves this goal using noise enhancement''
It turns out that large-batch training with the noise enhancement even shows better generalization compared with small-batch training.
arXiv Detail & Related papers (2020-09-28T06:29:23Z) - Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.