How Can Increased Randomness in Stochastic Gradient Descent Improve
Generalization?
- URL: http://arxiv.org/abs/2108.09507v1
- Date: Sat, 21 Aug 2021 13:18:49 GMT
- Title: How Can Increased Randomness in Stochastic Gradient Descent Improve
Generalization?
- Authors: Arwen V. Bradley and Carlos Alberto Gomez-Uribe
- Abstract summary: We study the role of the SGD learning rate and batch size in generalization.
We show that increasing SGD temperature encourages the selection of local minima with lower curvature.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works report that increasing the learning rate or decreasing the
minibatch size in stochastic gradient descent (SGD) can improve test set
performance. We argue this is expected under some conditions in models with a
loss function with multiple local minima. Our main contribution is an
approximate but analytical approach inspired by methods in Physics to study the
role of the SGD learning rate and batch size in generalization. We characterize
test set performance under a shift between the training and test data
distributions for loss functions with multiple minima. The shift can simply be
due to sampling, and is therefore typically present in practical applications.
We show that the resulting shift in local minima worsens test performance by
picking up curvature, implying that generalization improves by selecting wide
and/or little-shifted local minima. We then specialize to SGD, and study its
test performance under stationarity. Because obtaining the exact stationary
distribution of SGD is intractable, we derive a Fokker-Planck approximation of
SGD and obtain its stationary distribution instead. This process shows that the
learning rate divided by the minibatch size plays a role analogous to
temperature in statistical mechanics, and implies that SGD, including its
stationary distribution, is largely invariant to changes in learning rate or
batch size that leave its temperature constant. We show that increasing SGD
temperature encourages the selection of local minima with lower curvature, and
can enable better generalization. We provide experiments on CIFAR10
demonstrating the temperature invariance of SGD, improvement of the test loss
as SGD temperature increases, and quantifying the impact of sampling versus
domain shift in driving this effect. Finally, we present synthetic experiments
showing how our theory applies in a simplified loss with two local minima.
Related papers
- Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution [6.144680854063938]
We consider a variant of the gradient descent (SGD) with a random learning rate to reveal its convergence properties.
We demonstrate that a distribution of a parameter updated by Poisson SGD converges to a stationary distribution under weak assumptions.
arXiv Detail & Related papers (2024-06-23T06:52:33Z) - Why is parameter averaging beneficial in SGD? An objective smoothing perspective [13.863368438870562]
gradient descent (SGD) and its implicit bias are often characterized in terms of the sharpness of the minima.
We study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al.
We prove that averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima.
arXiv Detail & Related papers (2023-02-18T16:29:06Z) - Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent:
Convergence Guarantees and Empirical Benefits [21.353189917487512]
gradient descent (SGD) and its variants have established themselves as the go-to algorithms for machine learning problems.
We take a step forward by proving minibatch SGD converges to a critical point of the full log-likelihood loss function.
Our theoretical guarantees hold provided that the kernel functions exhibit exponential or eigendecay.
arXiv Detail & Related papers (2021-11-19T22:28:47Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Noise and Fluctuation of Finite Learning Rate Stochastic Gradient
Descent [3.0079490585515343]
gradient descent (SGD) is relatively well understood in the vanishing learning rate regime.
We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
arXiv Detail & Related papers (2020-12-07T12:31:43Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Detecting Rewards Deterioration in Episodic Reinforcement Learning [63.49923393311052]
In many RL applications, once training ends, it is vital to detect any deterioration in the agent performance as soon as possible.
We consider an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov.
We define the mean-shift in a way corresponding to deterioration of a temporal signal (such as the rewards), and derive a test for this problem with optimal statistical power.
arXiv Detail & Related papers (2020-10-22T12:45:55Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z) - A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient
Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima.
We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.