Related papers: Disentangling the Mechanisms Behind Implicit Regularization in SGD

Disentangling the Mechanisms Behind Implicit Regularization in SGD

URL: http://arxiv.org/abs/2211.15853v1
Date: Tue, 29 Nov 2022 01:05:04 GMT
Title: Disentangling the Mechanisms Behind Implicit Regularization in SGD
Authors: Zachary Novack, Simran Kaur, Tanya Marwah, Saurabh Garg, Zachary C. Lipton
Abstract summary: This paper focuses on the ability of various theorized mechanisms to close the small-to-large batch generalization gap. We show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over micro-batches, in the large-batch regime recovers small-batch SGD generalization. This generalization performance is shown to often be correlated with how well the regularized model's gradient norms resemble those of small-batch SGD.
Score: 21.893397581060636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A number of competing hypotheses have been proposed to explain why small-batch Stochastic Gradient Descent (SGD)leads to improved generalization over the full-batch regime, with recent work crediting the implicit regularization of various quantities throughout training. However, to date, empirical evidence assessing the explanatory power of these hypotheses is lacking. In this paper, we conduct an extensive empirical evaluation, focusing on the ability of various theorized mechanisms to close the small-to-large batch generalization gap. Additionally, we characterize how the quantities that SGD has been claimed to (implicitly) regularize change over the course of training. By using micro-batches, i.e. disjoint smaller subsets of each mini-batch, we empirically show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over micro-batches, in the large-batch regime recovers small-batch SGD generalization, whereas Jacobian-based regularizations fail to do so. This generalization performance is shown to often be correlated with how well the regularized model's gradient norms resemble those of small-batch SGD. We additionally show that this behavior breaks down as the micro-batch size approaches the batch size. Finally, we note that in this line of inquiry, positive experimental findings on CIFAR10 are often reversed on other datasets like CIFAR100, highlighting the need to test hypotheses on a wider collection of datasets.

Related papers

Rethinking Regularization Methods for Knowledge Graph Completion [25.269091177345565]
We introduce a novel sparse-regularization method that embeds the concept of rank-based selective sparsity into the KGC regularizer.<n>Various experiments on multiple datasets and multiple models show that the SPR regularization method is better than other regularization methods and can enable the KGC model to further break through the performance margin.
arXiv Detail & Related papers (2025-05-29T13:39:18Z)
Fine-Grained Bias Exploration and Mitigation for Group-Robust Classification [11.525201208566925]
Bias Exploration via Overfitting (BEO) captures each distribution in greater detail by modeling it as a mixture of latent groups.<n>We introduce a fine-grained variant of CCDB, termed FG-CCDB, which performs more precise distribution matching and balancing within each group.<n>Our method performs on par with bias-supervised approaches on binary classification tasks and significantly outperforms them in highly biased multi-class scenarios.
arXiv Detail & Related papers (2025-05-11T04:01:34Z)
Deep Anti-Regularized Ensembles provide reliable out-of-distribution uncertainty quantification [4.750521042508541]
Deep ensemble often return overconfident estimates outside the training domain. We show that an ensemble of networks with large weights fitting the training data are likely to meet these two objectives. We derive a theoretical framework for this approach and show that the proposed optimization can be seen as a "water-filling" problem.
arXiv Detail & Related papers (2023-04-08T15:25:12Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits [21.353189917487512]
gradient descent (SGD) and its variants have established themselves as the go-to algorithms for machine learning problems. We take a step forward by proving minibatch SGD converges to a critical point of the full log-likelihood loss function. Our theoretical guarantees hold provided that the kernel functions exhibit exponential or eigendecay.
arXiv Detail & Related papers (2021-11-19T22:28:47Z)
Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z)
The Benefits of Implicit Regularization from SGD in Least Squares Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice. We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z)
Implicit Gradient Alignment in Distributed and Federated Learning [39.61762498388211]
A major obstacle to achieving global convergence in distributed and federated learning is misalignment of gradients across clients. We propose a novel GradAlign algorithm that induces the same implicit regularization while allowing the use of arbitrarily large batches in each update.
arXiv Detail & Related papers (2021-06-25T22:01:35Z)
Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically. This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression. We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z)
On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error. For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z)
Generalized Sliced Distances for Probability Distributions [47.543990188697734]
We introduce a broad family of probability metrics, coined as Generalized Sliced Probability Metrics (GSPMs) GSPMs are rooted in the generalized Radon transform and come with a unique geometric interpretation. We consider GSPM-based gradient flows for generative modeling applications and show that under mild assumptions, the gradient flow converges to the global optimum.
arXiv Detail & Related papers (2020-02-28T04:18:00Z)
On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models. We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.