On generalization bounds for deep networks based on loss surface
implicit regularization
- URL: http://arxiv.org/abs/2201.04545v1
- Date: Wed, 12 Jan 2022 16:41:34 GMT
- Title: On generalization bounds for deep networks based on loss surface
implicit regularization
- Authors: Masaaki Imaizumi, Johannes Schmidt-Hieber
- Abstract summary: Modern deep neural networks generalize well despite a large number of parameters.
That modern deep neural networks generalize well despite a large number of parameters contradicts the classical statistical learning theory.
- Score: 5.68558935178946
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The classical statistical learning theory says that fitting too many
parameters leads to overfitting and poor performance. That modern deep neural
networks generalize well despite a large number of parameters contradicts this
finding and constitutes a major unsolved problem towards explaining the success
of deep learning. The implicit regularization induced by stochastic gradient
descent (SGD) has been regarded to be important, but its specific principle is
still unknown. In this work, we study how the local geometry of the energy
landscape around local minima affects the statistical properties of SGD with
Gaussian gradient noise. We argue that under reasonable assumptions, the local
geometry forces SGD to stay close to a low dimensional subspace and that this
induces implicit regularization and results in tighter bounds on the
generalization error for deep neural networks. To derive generalization error
bounds for neural networks, we first introduce a notion of stagnation sets
around the local minima and impose a local essential convexity property of the
population risk. Under these conditions, lower bounds for SGD to remain in
these stagnation sets are derived. If stagnation occurs, we derive a bound on
the generalization error of deep neural networks involving the spectral norms
of the weight matrices but not the number of network parameters. Technically,
our proofs are based on controlling the change of parameter values in the SGD
iterates and local uniform convergence of the empirical loss functions based on
the entropy of suitable neighborhoods around local minima. Our work attempts to
better connect non-convex optimization and generalization analysis with uniform
convergence.
Related papers
- Generalization of Scaled Deep ResNets in the Mean-Field Regime [55.77054255101667]
We investigate emphscaled ResNet in the limit of infinitely deep and wide neural networks.
Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime.
arXiv Detail & Related papers (2024-03-14T21:48:00Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Generalization Error Bounds for Deep Neural Networks Trained by SGD [3.148524502470734]
Generalization error bounds for deep trained by gradient descent (SGD) are derived.
The bounds explicitly depend on the loss along the training trajectory.
Results show that our bounds are non-vacuous and robust with the change of neural networks and network hypers.
arXiv Detail & Related papers (2022-06-07T13:46:10Z) - Robust Estimation for Nonparametric Families via Generative Adversarial
Networks [92.64483100338724]
We provide a framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems.
Our work extend these to robust mean estimation, second moment estimation, and robust linear regression.
In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance.
arXiv Detail & Related papers (2022-02-02T20:11:33Z) - Global convergence of ResNets: From finite to infinite width using
linear parameterization [0.0]
We study Residual Networks (ResNets) in which the residual block has linear parametrization while still being nonlinear.
In this limit, we prove a local Polyak-Lojasiewicz inequality, retrieving the lazy regime.
Our analysis leads to a practical and quantified recipe.
arXiv Detail & Related papers (2021-12-10T13:38:08Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Why Flatness Correlates With Generalization For Deep Neural Networks [0.0]
We argue that local flatness measures correlate with generalization because they are local approximations to a global property.
For functions that give zero error on a test set, it is directly proportional to the Bayesian posterior.
Some variants of SGD can break the flatness-generalization correlation, while the volume-generalization correlation remains intact.
arXiv Detail & Related papers (2021-03-10T17:44:52Z) - Explicit regularization and implicit bias in deep network classifiers
trained with the square loss [2.8935588665357077]
Deep ReLU networks trained with the square loss have been observed to perform well in classification tasks.
We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques are used together with Weight Decay.
arXiv Detail & Related papers (2020-12-31T21:07:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.