Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks
- URL: http://arxiv.org/abs/2006.09313v3
- Date: Sat, 22 May 2021 22:32:14 GMT
- Title: Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks
- Authors: Umut \c{S}im\c{s}ekli, Ozan Sener, George Deligiannidis, Murat A.
Erdogdu
- Abstract summary: We show that the trajectories of gradient descent (SGD) can be wellapproximated by a emphFeller process.
We propose a "capacity metric" to measure the success of such generalizations.
- Score: 27.54155197562196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite its success in a wide range of applications, characterizing the
generalization properties of stochastic gradient descent (SGD) in non-convex
deep learning problems is still an important challenge. While modeling the
trajectories of SGD via stochastic differential equations (SDE) under
heavy-tailed gradient noise has recently shed light over several peculiar
characteristics of SGD, a rigorous treatment of the generalization properties
of such SDEs in a learning theoretical framework is still missing. Aiming to
bridge this gap, in this paper, we prove generalization bounds for SGD under
the assumption that its trajectories can be well-approximated by a \emph{Feller
process}, which defines a rich class of Markov processes that include several
recent SDE representations (both Brownian or heavy-tailed) as its special case.
We show that the generalization error can be controlled by the \emph{Hausdorff
dimension} of the trajectories, which is intimately linked to the tail behavior
of the driving process. Our results imply that heavier-tailed processes should
achieve better generalization; hence, the tail-index of the process can be used
as a notion of "capacity metric". We support our theory with experiments on
deep neural networks illustrating that the proposed capacity metric accurately
estimates the generalization error, and it does not necessarily grow with the
number of parameters unlike the existing capacity metrics in the literature.
Related papers
- Generalization Bounds for Heavy-Tailed SDEs through the Fractional Fokker-Planck Equation [1.8416014644193066]
We prove high-probability bounds generalization for heavy-tailed SDEs with no nontrivial information theoretic terms.
Our results suggest that heavy tails can be either beneficial or harmful depending on the problem structure.
arXiv Detail & Related papers (2024-02-12T15:35:32Z) - Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - The Heavy-Tail Phenomenon in SGD [7.366405857677226]
We show that depending on the structure of the Hessian of the loss at the minimum, the SGD iterates will converge to a emphheavy-tailed stationary distribution.
We translate our results into insights about the behavior of SGD in deep learning.
arXiv Detail & Related papers (2020-06-08T16:43:56Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.