Time-Independent Information-Theoretic Generalization Bounds for SGLD
- URL: http://arxiv.org/abs/2311.01046v1
- Date: Thu, 2 Nov 2023 07:42:23 GMT
- Title: Time-Independent Information-Theoretic Generalization Bounds for SGLD
- Authors: Futoshi Futami, Masahiro Fujisawa
- Abstract summary: We provide novel information-theoretic generalization bounds for Langevin dynamics datasets.
Our bounds are based on the assumptions of smoothness and dissipation, and are non-exponential.
- Score: 4.73194777046253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We provide novel information-theoretic generalization bounds for stochastic
gradient Langevin dynamics (SGLD) under the assumptions of smoothness and
dissipativity, which are widely used in sampling and non-convex optimization
studies. Our bounds are time-independent and decay to zero as the sample size
increases, regardless of the number of iterations and whether the step size is
fixed. Unlike previous studies, we derive the generalization error bounds by
focusing on the time evolution of the Kullback--Leibler divergence, which is
related to the stability of datasets and is the upper bound of the mutual
information between output parameters and an input dataset. Additionally, we
establish the first information-theoretic generalization bound when the
training and test loss are the same by showing that a loss function of SGLD is
sub-exponential. This bound is also time-independent and removes the
problematic step size dependence in existing work, leading to an improved
excess risk bound by combining our analysis with the existing non-convex
optimization error bounds.
Related papers
- Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework [56.82432591933544]
Distributed gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning.
This paper presents the run time and staleness of distributed SGD based on delay differential equations (SDDEs) and the approximation of gradient arrivals.
It is interestingly shown that increasing the number of activated workers does not necessarily accelerate distributed SGD due to staleness.
arXiv Detail & Related papers (2024-06-17T02:56:55Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - Understanding the Generalization Ability of Deep Learning Algorithms: A
Kernelized Renyi's Entropy Perspective [11.255943520955764]
We propose a novel information theoretical measure: kernelized Renyi's entropy.
We establish the generalization error bounds for gradient/Langevin descent (SGD/SGLD) learning algorithms under kernelized Renyi's entropy.
We show that our information-theoretical bounds depend on the statistics of the gradients, and are rigorously tighter than the current state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-05-02T01:17:15Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - Time-independent Generalization Bounds for SGLD in Non-convex Settings [23.833787505938858]
We establish generalization error bounds for Langevin dynamics (SGLD) with constant learning rate under the assumptions of dissipativity and Euclidean gradient projections.
Our analysis also supports variants that use different discretization methods, or use non-is-is-noise projections.
arXiv Detail & Related papers (2021-11-25T02:31:52Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z) - Online stochastic gradient descent on non-convex losses from
high-dimensional inference [2.2344764434954256]
gradient descent (SGD) is a popular algorithm for optimization problems in high-dimensional tasks.
In this paper we produce an estimator of non-trivial correlation from data.
We illustrate our approach by applying it to a set of tasks such as phase retrieval, and estimation for generalized models.
arXiv Detail & Related papers (2020-03-23T17:34:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.