On the Origin of Implicit Regularization in Stochastic Gradient Descent
- URL: http://arxiv.org/abs/2101.12176v1
- Date: Thu, 28 Jan 2021 18:32:14 GMT
- Title: On the Origin of Implicit Regularization in Stochastic Gradient Descent
- Authors: Samuel L. Smith, Benoit Dherin, David G. T. Barrett and Soham De
- Abstract summary: gradient descent (SGD) follows the path of gradient flow on the full batch loss function.
We prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite.
We verify that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.
- Score: 22.802683068658897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For infinitesimal learning rates, stochastic gradient descent (SGD) follows
the path of gradient flow on the full batch loss function. However moderately
large learning rates can achieve higher test accuracies, and this
generalization benefit is not explained by convergence bounds, since the
learning rate which maximizes test accuracy is often larger than the learning
rate which minimizes training loss. To interpret this phenomenon we prove that
for SGD with random shuffling, the mean SGD iterate also stays close to the
path of gradient flow if the learning rate is small and finite, but on a
modified loss. This modified loss is composed of the original loss function and
an implicit regularizer, which penalizes the norms of the minibatch gradients.
Under mild assumptions, when the batch size is small the scale of the implicit
regularization term is proportional to the ratio of the learning rate to the
batch size. We verify empirically that explicitly including the implicit
regularizer in the loss can enhance the test accuracy when the learning rate is
small.
Related papers
- Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Implicit bias of deep linear networks in the large learning rate phase [15.846533303963229]
We characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in a large learning rate regime.
We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase.
arXiv Detail & Related papers (2020-11-25T06:50:30Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Implicit Gradient Regularization [18.391141066502644]
gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization.
We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization.
arXiv Detail & Related papers (2020-09-23T14:17:53Z) - Implicit Bias in Deep Linear Classification: Initialization Scale vs
Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss.
Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z) - A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling.
We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z) - Unbiased Risk Estimators Can Mislead: A Case Study of Learning with
Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels.
We show that the quality of gradient estimation matters more in risk minimization.
We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.