Related papers: Label Noise SGD Provably Prefers Flat Global Minimizers

Label Noise SGD Provably Prefers Flat Global Minimizers

URL: http://arxiv.org/abs/2106.06530v1
Date: Fri, 11 Jun 2021 17:59:07 GMT
Title: Label Noise SGD Provably Prefers Flat Global Minimizers
Authors: Alex Damian, Tengyu Ma, Jason Lee
Abstract summary: In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
Score: 48.883469271546076
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to a stationary point of a regularized loss $L(\theta) +\lambda R(\theta)$, where $L(\theta)$ is the training loss, $\lambda$ is an effective regularization parameter depending on the step size, strength of the label noise, and the batch size, and $R(\theta)$ is an explicit regularizer that penalizes sharp minimizers. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones. We also prove extensions to classification with general loss functions, SGD with momentum, and SGD with general noise covariance, significantly strengthening the prior work of Blanc et al. to global convergence and large learning rates and of HaoChen et al. to general models.

Related papers

On the Trajectories of SGD Without Replacement [0.0]
This article examines the implicit regularization effect of Gradient Descent (SGD) We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks.
arXiv Detail & Related papers (2023-12-26T18:06:48Z)
Doubly Stochastic Models: Learning with Unbiased Label Noises and Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent. We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters. Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z)
Why is parameter averaging beneficial in SGD? An objective smoothing perspective [13.863368438870562]
gradient descent (SGD) and its implicit bias are often characterized in terms of the sharpness of the minima. We study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. We prove that averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima.
arXiv Detail & Related papers (2023-02-18T16:29:06Z)
When does SGD favor flat minima? A quantitative characterization via linear stability [7.252584656056866]
gradient descent (SGD) favors flat minima. Property of SGD noise provably holds for linear networks and random feature models (RFMs)
arXiv Detail & Related papers (2022-07-06T12:40:09Z)
Optimal Online Generalized Linear Regression with Stochastic Noise and Its Application to Heteroscedastic Bandits [88.6139446295537]
We study the problem of online generalized linear regression in the setting of a generalized linear model with possibly unbounded additive noise. We provide a sharp analysis of the classical follow-the-regularized-leader (FTRL) algorithm to cope with the label noise. We propose an algorithm based on FTRL to achieve the first variance-aware regret bound.
arXiv Detail & Related papers (2022-02-28T08:25:26Z)
Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data. We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
SGD Generalizes Better Than GD (And Regularization Doesn't Help) [39.588906680621825]
We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD) We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error. We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
arXiv Detail & Related papers (2021-02-01T19:18:40Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Improved generalization by noise enhancement [5.33024001730262]
Noise in gradient descent (SGD) is closely related to generalization. We propose a method that achieves this goal using noise enhancement'' It turns out that large-batch training with the noise enhancement even shows better generalization compared with small-batch training.
arXiv Detail & Related papers (2020-09-28T06:29:23Z)
Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models. We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.