SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs
- URL: http://arxiv.org/abs/2107.05074v1
- Date: Sun, 11 Jul 2021 15:50:01 GMT
- Title: SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs
- Authors: Satyen Kale, Ayush Sekhari, Karthik Sridharan
- Abstract summary: We present a multi-epoch variant of Gradient Descent (SGD) commonly used in practice.
We prove that this is at least as good as single pass SGD in the worst case.
For certain SCO problems, taking multiple passes over the dataset can significantly outperform single pass SGD.
- Score: 30.41773138781369
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the
method of choice for learning with large over-parameterized models. A popular
theory for explaining why SGD works well in practice is that the algorithm has
an implicit regularization that biases its output towards a good solution.
Perhaps the theoretically most well understood learning setting for SGD is that
of Stochastic Convex Optimization (SCO), where it is well known that SGD learns
at a rate of $O(1/\sqrt{n})$, where $n$ is the number of samples. In this
paper, we consider the problem of SCO and explore the role of implicit
regularization, batch size and multiple epochs for SGD. Our main contributions
are threefold:
(a) We show that for any regularizer, there is an SCO problem for which
Regularized Empirical Risk Minimzation fails to learn. This automatically rules
out any implicit regularization based explanation for the success of SGD.
(b) We provide a separation between SGD and learning via Gradient Descent on
empirical loss (GD) in terms of sample complexity. We show that there is an SCO
problem such that GD with any step size and number of iterations can only learn
at a suboptimal rate: at least $\widetilde{\Omega}(1/n^{5/12})$.
(c) We present a multi-epoch variant of SGD commonly used in practice. We
prove that this algorithm is at least as good as single pass SGD in the worst
case. However, for certain SCO problems, taking multiple passes over the
dataset can significantly outperform single pass SGD.
We extend our results to the general learning setting by showing a problem
which is learnable for any data distribution, and for this problem, SGD is
strictly better than RERM for any regularization function. We conclude by
discussing the implications of our results for deep learning, and show a
separation between SGD and ERM for two layer diagonal neural networks.
Related papers
- On the Trajectories of SGD Without Replacement [0.0]
This article examines the implicit regularization effect of Gradient Descent (SGD)
We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks.
arXiv Detail & Related papers (2023-12-26T18:06:48Z) - Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - Label Noise SGD Provably Prefers Flat Global Minimizers [48.883469271546076]
In overparametrized models, the noise in gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to.
We show that SGD with label noise converges to a stationary point of a regularized loss $L(theta) +lambda R(theta)$, where $L(theta)$ is the training loss.
Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.
arXiv Detail & Related papers (2021-06-11T17:59:07Z) - SGD Generalizes Better Than GD (And Regularization Doesn't Help) [39.588906680621825]
We give a new separation result between the generalization performance of gradient descent (SGD) and of full-batch gradient descent (GD)
We show that with the same number of steps GD may overfit and emit a solution with $Omega(1)$ generalization error.
We discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
arXiv Detail & Related papers (2021-02-01T19:18:40Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Least Squares Regression with Markovian Data: Fundamental Limits and
Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain.
We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$.
We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.