Noisy Truncated SGD: Optimization and Generalization
- URL: http://arxiv.org/abs/2103.00075v1
- Date: Fri, 26 Feb 2021 22:39:41 GMT
- Title: Noisy Truncated SGD: Optimization and Generalization
- Authors: Yingxue Zhou, Xinyan Li, Arindam Banerjee
- Abstract summary: Recent empirical work on SGD has shown that most gradient components over epochs are quite small.
Inspired by such a study, we rigorously study properties of noisy SGD (NT-SGD)
We prove that NT-SGD can provably escape from saddle points and requires less noise compared to previous related work.
- Score: 27.33458360279836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent empirical work on SGD applied to over-parameterized deep learning has
shown that most gradient components over epochs are quite small. Inspired by
such observations, we rigorously study properties of noisy truncated SGD
(NT-SGD), a noisy gradient descent algorithm that truncates (hard thresholds)
the majority of small gradient components to zeros and then adds Gaussian noise
to all components. Considering non-convex smooth problems, we first establish
the rate of convergence of NT-SGD in terms of empirical gradient norms, and
show the rate to be of the same order as the vanilla SGD. Further, we prove
that NT-SGD can provably escape from saddle points and requires less noise
compared to previous related work. We also establish a generalization bound for
NT-SGD using uniform stability based on discretized generalized Langevin
dynamics. Our experiments on MNIST (VGG-5) and CIFAR-10 (ResNet-18) demonstrate
that NT-SGD matches the speed and accuracy of vanilla SGD, and can successfully
escape sharp minima while having better theoretical properties.
Related papers
- Why is parameter averaging beneficial in SGD? An objective smoothing perspective [13.863368438870562]
gradient descent (SGD) and its implicit bias are often characterized in terms of the sharpness of the minima.
We study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al.
We prove that averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima.
arXiv Detail & Related papers (2023-02-18T16:29:06Z) - From Gradient Flow on Population Loss to Learning with Stochastic
Gradient Descent [50.4531316289086]
Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models.
An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges.
We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
arXiv Detail & Related papers (2022-10-13T03:55:04Z) - When does SGD favor flat minima? A quantitative characterization via
linear stability [7.252584656056866]
gradient descent (SGD) favors flat minima.
Property of SGD noise provably holds for linear networks and random feature models (RFMs)
arXiv Detail & Related papers (2022-07-06T12:40:09Z) - Clipped Stochastic Methods for Variational Inequalities with
Heavy-Tailed Noise [64.85879194013407]
We prove the first high-probability results with logarithmic dependence on the confidence level for methods for solving monotone and structured non-monotone VIPs.
Our results match the best-known ones in the light-tails case and are novel for structured non-monotone problems.
In addition, we numerically validate that the gradient noise of many practical formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
arXiv Detail & Related papers (2022-06-02T15:21:55Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Label Noise SGD Provably Prefers Flat Global Minimizers [48.883469271546076]
In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.
We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss.
Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
arXiv Detail & Related papers (2021-06-11T17:59:07Z) - Understanding Long Range Memory Effects in Deep Neural Networks [10.616643031188248]
textitstochastic gradient descent (SGD) is of fundamental importance in deep learning.
In this study, we argue that SGN is neither Gaussian nor stable. Instead, we propose that SGD can be viewed as a discretization of an SDE driven by textitfractional Brownian motion (FBM)
arXiv Detail & Related papers (2021-05-05T13:54:26Z) - On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes [2.6763498831034043]
Noise in gradient descent (SGD) caused by minibatch sampling remains poorly understood.
Motivated by the observation that minibatch sampling does not always cause a fluctuation, we set out to find the conditions that cause minibatch noise to emerge.
arXiv Detail & Related papers (2021-02-10T10:38:55Z) - Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM
in Deep Learning [165.47118387176607]
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed.
Specifically, we observe the heavy tails of gradient noise in these algorithms.
arXiv Detail & Related papers (2020-10-12T12:00:26Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.