Stochastic Training is Not Necessary for Generalization
- URL: http://arxiv.org/abs/2109.14119v1
- Date: Wed, 29 Sep 2021 00:50:00 GMT
- Title: Stochastic Training is Not Necessary for Generalization
- Authors: Jonas Geiping, Micah Goldblum, Phillip E. Pope, Michael Moeller, Tom
Goldstein
- Abstract summary: It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
- Score: 57.04880404584737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is widely believed that the implicit regularization of stochastic gradient
descent (SGD) is fundamental to the impressive generalization behavior we
observe in neural networks. In this work, we demonstrate that non-stochastic
full-batch training can achieve strong performance on CIFAR-10 that is on-par
with SGD, using modern architectures in settings with and without data
augmentation. To this end, we utilize modified hyperparameters and show that
the implicit regularization of SGD can be completely replaced with explicit
regularization. This strongly suggests that theories that rely heavily on
properties of stochastic sampling to explain generalization are incomplete, as
strong generalization behavior is still observed in the absence of stochastic
sampling. Fundamentally, deep learning can succeed without stochasticity. Our
observations further indicate that the perceived difficulty of full-batch
training is largely the result of its optimization properties and the
disproportionate time and effort spent by the ML community tuning optimizers
and hyperparameters for small-batch training.
Related papers
- Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications.
However, generalization properties of second-order methods are still being debated.
We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - Understanding Why Generalized Reweighting Does Not Improve Over ERM [36.69039005731499]
Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different.
A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO) have been proposed to solve this problem.
But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift.
arXiv Detail & Related papers (2022-01-28T17:58:38Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.