Related papers: On the Origin of Implicit Regularization in Stochastic Gradient Descent

On the Origin of Implicit Regularization in Stochastic Gradient Descent

URL: http://arxiv.org/abs/2101.12176v1
Date: Thu, 28 Jan 2021 18:32:14 GMT
Title: On the Origin of Implicit Regularization in Stochastic Gradient Descent
Authors: Samuel L. Smith, Benoit Dherin, David G. T. Barrett and Soham De
Abstract summary: gradient descent (SGD) follows the path of gradient flow on the full batch loss function. We prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite. We verify that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.
Score: 22.802683068658897
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of gradient flow on the full batch loss function. However moderately large learning rates can achieve higher test accuracies, and this generalization benefit is not explained by convergence bounds, since the learning rate which maximizes test accuracy is often larger than the learning rate which minimizes training loss. To interpret this phenomenon we prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss. This modified loss is composed of the original loss function and an implicit regularizer, which penalizes the norms of the minibatch gradients. Under mild assumptions, when the batch size is small the scale of the implicit regularization term is proportional to the ratio of the learning rate to the batch size. We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.

Related papers

Beyond Gradient Averaging in Parallel Optimization: Improved Robustness through Gradient Agreement Filtering [36.896695278624776]
Traditional distributed data-parallel gradient descent involves averaging gradients of microbatches to calculate a macrobatch that is then used to update model parameters. We introduce a simple, computationally effective way to reduce gradient variance by computing the cosine distance between micro-gradients during training. We show this technique consistently outperforms validation accuracy in some cases by up to 18.2% compared to traditional training approaches.
arXiv Detail & Related papers (2024-12-24T00:00:11Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Implicit bias of deep linear networks in the large learning rate phase [15.846533303963229]
We characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in a large learning rate regime. We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase.
arXiv Detail & Related papers (2020-11-25T06:50:30Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Implicit Gradient Regularization [18.391141066502644]
gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization.
arXiv Detail & Related papers (2020-09-23T14:17:53Z)
Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy [71.25689267025244]
We show how the transition is controlled by the relationship between the scale and how accurately we minimize the training loss. Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies.
arXiv Detail & Related papers (2020-07-13T23:49:53Z)
A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling. We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z)
Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels. We show that the quality of gradient estimation matters more in risk minimization. We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.