Why Does Multi-Epoch Training Help?
- URL: http://arxiv.org/abs/2105.06015v1
- Date: Thu, 13 May 2021 00:52:25 GMT
- Title: Why Does Multi-Epoch Training Help?
- Authors: Yi Xu, Qi Qian, Hao Li, Rong Jin
- Abstract summary: Empirically, it has been observed that taking more one pass over training data (multi-pass SGD) has much better excess risk bound performance than SGD only taking one pass over training data (one-pass SGD)
In this paper, we provide some theoretical evidences for explaining why multiple passes over the training data can help improve performance under certain circumstances.
- Score: 62.946840431501855
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stochastic gradient descent (SGD) has become the most attractive optimization
method in training large-scale deep neural networks due to its simplicity, low
computational cost in each updating step, and good performance. Standard excess
risk bounds show that SGD only needs to take one pass over the training data
and more passes could not help to improve the performance. Empirically, it has
been observed that SGD taking more than one pass over the training data
(multi-pass SGD) has much better excess risk bound performance than the SGD
only taking one pass over the training data (one-pass SGD). However, it is not
very clear that how to explain this phenomenon in theory. In this paper, we
provide some theoretical evidences for explaining why multiple passes over the
training data can help improve performance under certain circumstance.
Specifically, we consider smooth risk minimization problems whose objective
function is non-convex least squared loss. Under Polyak-Lojasiewicz (PL)
condition, we establish faster convergence rate of excess risk bound for
multi-pass SGD than that for one-pass SGD.
Related papers
- Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation [3.6185342807265415]
It remains an open problem of research to explain the success and the limitations of SGD methods in rigorous theoretical terms.
In this work we prove for a large class of SGD methods that the considered does with high probability not converge to global minimizers of the optimization problem.
The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods.
arXiv Detail & Related papers (2024-10-14T14:11:37Z) - Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - Robust Training of Neural Networks using Scale Invariant Architectures [70.67803417918854]
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks.
We show that this general approach is robust to rescaling of parameter and loss.
We design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam.
arXiv Detail & Related papers (2022-02-02T11:58:56Z) - SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs [30.41773138781369]
We present a multi-epoch variant of Gradient Descent (SGD) commonly used in practice.
We prove that this is at least as good as single pass SGD in the worst case.
For certain SCO problems, taking multiple passes over the dataset can significantly outperform single pass SGD.
arXiv Detail & Related papers (2021-07-11T15:50:01Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Variance Reduced Local SGD with Lower Communication Complexity [52.44473777232414]
We propose Variance Reduced Local SGD to further reduce the communication complexity.
VRL-SGD achieves a emphlinear iteration speedup with a lower communication complexity $O(Tfrac12 Nfrac32)$ even if workers access non-identical datasets.
arXiv Detail & Related papers (2019-12-30T08:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.