Why is parameter averaging beneficial in SGD? An objective smoothing perspective
- URL: http://arxiv.org/abs/2302.09376v2
- Date: Sun, 26 May 2024 11:54:08 GMT
- Title: Why is parameter averaging beneficial in SGD? An objective smoothing perspective
- Authors: Atsushi Nitanda, Ryuhei Kikuchi, Shugo Maeda, Denny Wu,
- Abstract summary: gradient descent (SGD) and its implicit bias are often characterized in terms of the sharpness of the minima.
We study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al.
We prove that averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima.
- Score: 13.863368438870562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD.
Related papers
- Langevin Dynamics: A Unified Perspective on Optimization via Lyapunov Potentials [15.718093624695552]
We analyze the convergence of Gradient Langevin Dynamics (SGLD) to global minima based on Lyapunov potentials and optimization.
We provide 1) improved in the setting of previous works SGLD for optimization, 2) first finite gradient complexity for SGLD, and 3) prove if continuous-time Langevin Dynamics succeeds for optimization, then discrete-time SGLD succeeds under mild regularity assumptions.
arXiv Detail & Related papers (2024-07-05T05:34:10Z) - Diagonalisation SGD: Fast & Convergent SGD for Non-Differentiable Models
via Reparameterisation and Smoothing [1.6114012813668932]
We introduce a simple framework to define non-differentiable functions piecewisely and present a systematic approach to obtain smoothings.
Our main contribution is a novel variant of SGD, Diagonalisation Gradient Descent, which progressively enhances the accuracy of the smoothed approximation.
Our approach is simple, fast stable and attains orders of magnitude reduction in work-normalised variance.
arXiv Detail & Related papers (2024-02-19T00:43:22Z) - Bias-Aware Minimisation: Understanding and Mitigating Estimator Bias in
Private SGD [56.01810892677744]
We show a connection between per-sample gradient norms and the estimation bias of the private gradient oracle used in DP-SGD.
We propose Bias-Aware Minimisation (BAM) that allows for the provable reduction of private gradient estimator bias.
arXiv Detail & Related papers (2023-08-23T09:20:41Z) - Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves
Generalization [33.50116027503244]
We show that the zeroth-order flatness can be insufficient to discriminate minima with low gradient error.
We also present a novel training procedure named Gradient norm Aware Minimization (GAM) to seek minima with uniformly small curvature across all directions.
arXiv Detail & Related papers (2023-03-03T16:58:53Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - How Can Increased Randomness in Stochastic Gradient Descent Improve
Generalization? [0.0]
We study the role of the SGD learning rate and batch size in generalization.
We show that increasing SGD temperature encourages the selection of local minima with lower curvature.
arXiv Detail & Related papers (2021-08-21T13:18:49Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - Label Noise SGD Provably Prefers Flat Global Minimizers [48.883469271546076]
In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.
We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss.
Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
arXiv Detail & Related papers (2021-06-11T17:59:07Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM
in Deep Learning [165.47118387176607]
It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed.
Specifically, we observe the heavy tails of gradient noise in these algorithms.
arXiv Detail & Related papers (2020-10-12T12:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.