Disentangling the Mechanisms Behind Implicit Regularization in SGD
- URL: http://arxiv.org/abs/2211.15853v1
- Date: Tue, 29 Nov 2022 01:05:04 GMT
- Title: Disentangling the Mechanisms Behind Implicit Regularization in SGD
- Authors: Zachary Novack, Simran Kaur, Tanya Marwah, Saurabh Garg, Zachary C.
Lipton
- Abstract summary: This paper focuses on the ability of various theorized mechanisms to close the small-to-large batch generalization gap.
We show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over micro-batches, in the large-batch regime recovers small-batch SGD generalization.
This generalization performance is shown to often be correlated with how well the regularized model's gradient norms resemble those of small-batch SGD.
- Score: 21.893397581060636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A number of competing hypotheses have been proposed to explain why
small-batch Stochastic Gradient Descent (SGD)leads to improved generalization
over the full-batch regime, with recent work crediting the implicit
regularization of various quantities throughout training. However, to date,
empirical evidence assessing the explanatory power of these hypotheses is
lacking. In this paper, we conduct an extensive empirical evaluation, focusing
on the ability of various theorized mechanisms to close the small-to-large
batch generalization gap. Additionally, we characterize how the quantities that
SGD has been claimed to (implicitly) regularize change over the course of
training. By using micro-batches, i.e. disjoint smaller subsets of each
mini-batch, we empirically show that explicitly penalizing the gradient norm or
the Fisher Information Matrix trace, averaged over micro-batches, in the
large-batch regime recovers small-batch SGD generalization, whereas
Jacobian-based regularizations fail to do so. This generalization performance
is shown to often be correlated with how well the regularized model's gradient
norms resemble those of small-batch SGD. We additionally show that this
behavior breaks down as the micro-batch size approaches the batch size.
Finally, we note that in this line of inquiry, positive experimental findings
on CIFAR10 are often reversed on other datasets like CIFAR100, highlighting the
need to test hypotheses on a wider collection of datasets.
Related papers
- Rethinking Regularization Methods for Knowledge Graph Completion [25.269091177345565]
We introduce a novel sparse-regularization method that embeds the concept of rank-based selective sparsity into the KGC regularizer.<n>Various experiments on multiple datasets and multiple models show that the SPR regularization method is better than other regularization methods and can enable the KGC model to further break through the performance margin.
arXiv Detail & Related papers (2025-05-29T13:39:18Z) - Fine-Grained Bias Exploration and Mitigation for Group-Robust Classification [11.525201208566925]
Bias Exploration via Overfitting (BEO) captures each distribution in greater detail by modeling it as a mixture of latent groups.<n>We introduce a fine-grained variant of CCDB, termed FG-CCDB, which performs more precise distribution matching and balancing within each group.<n>Our method performs on par with bias-supervised approaches on binary classification tasks and significantly outperforms them in highly biased multi-class scenarios.
arXiv Detail & Related papers (2025-05-11T04:01:34Z) - Deep Anti-Regularized Ensembles provide reliable out-of-distribution
uncertainty quantification [4.750521042508541]
Deep ensemble often return overconfident estimates outside the training domain.
We show that an ensemble of networks with large weights fitting the training data are likely to meet these two objectives.
We derive a theoretical framework for this approach and show that the proposed optimization can be seen as a "water-filling" problem.
arXiv Detail & Related papers (2023-04-08T15:25:12Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent:
Convergence Guarantees and Empirical Benefits [21.353189917487512]
gradient descent (SGD) and its variants have established themselves as the go-to algorithms for machine learning problems.
We take a step forward by proving minibatch SGD converges to a critical point of the full log-likelihood loss function.
Our theoretical guarantees hold provided that the kernel functions exhibit exponential or eigendecay.
arXiv Detail & Related papers (2021-11-19T22:28:47Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - Implicit Gradient Alignment in Distributed and Federated Learning [39.61762498388211]
A major obstacle to achieving global convergence in distributed and federated learning is misalignment of gradients across clients.
We propose a novel GradAlign algorithm that induces the same implicit regularization while allowing the use of arbitrarily large batches in each update.
arXiv Detail & Related papers (2021-06-25T22:01:35Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - Generalized Sliced Distances for Probability Distributions [47.543990188697734]
We introduce a broad family of probability metrics, coined as Generalized Sliced Probability Metrics (GSPMs)
GSPMs are rooted in the generalized Radon transform and come with a unique geometric interpretation.
We consider GSPM-based gradient flows for generative modeling applications and show that under mild assumptions, the gradient flow converges to the global optimum.
arXiv Detail & Related papers (2020-02-28T04:18:00Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.