Related papers: Information-Theoretic Generalization Bounds for Stochastic Gradient Descent

Information-Theoretic Generalization Bounds for Stochastic Gradient Descent

URL: http://arxiv.org/abs/2102.00931v1
Date: Mon, 1 Feb 2021 16:00:34 GMT
Title: Information-Theoretic Generalization Bounds for Stochastic Gradient Descent
Authors: Gergely Neu
Abstract summary: We provide bounds on the technical error that depend on local statistics. Key factors are the objective variance of the gradients, the smoothness of the gradients, and the sensitivity of the loss function to perturbations. Our key tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with perturbation analysis of the paths.
Score: 13.757095663704858
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the generalization properties of the popular stochastic gradient descent method for optimizing general non-convex loss functions. Our main contribution is providing upper bounds on the generalization error that depend on local statistics of the stochastic gradients evaluated along the path of iterates calculated by SGD. The key factors our bounds depend on are the variance of the gradients (with respect to the data distribution) and the local smoothness of the objective function along the SGD path, and the sensitivity of the loss function to perturbations to the final output. Our key technical tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with a perturbation analysis of the iterates.

Related papers

Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression [4.150180443030652]
We introduce estimators that precisely track the generalization error of the iterates along the trajectory of the iterative algorithm. The results are illustrated through several examples, including Huber regression, pseudo-Huber regression, and their penalized variants with non-smooth regularizer.
arXiv Detail & Related papers (2024-10-03T16:13:42Z)
Addressing the Inconsistency in Bayesian Deep Learning via Generalized Laplace Approximation [23.185126261153236]
We introduce the generalized Laplace approximation, which requires only a simple modification to the Hessian calculation of the regularized loss.<n>We evaluate the proposed method on state-of-the-art neural networks and real-world datasets, demonstrating that the generalized Laplace approximation enhances predictive performance.
arXiv Detail & Related papers (2024-05-22T11:11:42Z)
Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
Stability vs Implicit Bias of Gradient Methods on Separable Data and Beyond [33.593203156666746]
We focus on the generalization properties of unregularized gradient-based learning procedures applied to separable linear classification. We give an additional unified explanation for this generalization, that we refer to as realizability and self-boundedness. In some of these cases, our bounds significantly improve upon the existing generalization error bounds in the literature.
arXiv Detail & Related papers (2022-02-27T19:56:36Z)
Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm. We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
The Sobolev Regularization Effect of Stochastic Gradient Descent [8.193914488276468]
We show that flat minima regularize the gradient of the model function, which explains the good performance of flat minima. We also consider high-order moments of gradient noise, and show that Gradient Dascent (SGD) tends to impose constraints on these moments by a linear analysis of SGD around global minima.
arXiv Detail & Related papers (2021-05-27T21:49:21Z)
Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically. This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression. We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z)
When Does Preconditioning Help or Hurt Generalization? [74.25170084614098]
We show how the textitimplicit bias of first and second order methods affects the comparison of generalization properties. We discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD.
arXiv Detail & Related papers (2020-06-18T17:57:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.