Related papers: Do highly over-parameterized neural networks generalize since bad solutions are rare?

Do highly over-parameterized neural networks generalize since bad solutions are rare?

URL: http://arxiv.org/abs/2211.03570v4
Date: Sun, 3 Dec 2023 13:50:19 GMT
Title: Do highly over-parameterized neural networks generalize since bad solutions are rare?
Authors: Julius Martinetz, Thomas Martinetz
Abstract summary: Empirical Risk Minimization (ERM) for learning leads to zero training error. We show that under certain conditions the fraction of "bad" global minima with a true error larger than epsilon decays to zero exponentially fast with the number of training data n.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study over-parameterized classifiers where Empirical Risk Minimization (ERM) for learning leads to zero training error. In these over-parameterized settings there are many global minima with zero training error, some of which generalize better than others. We show that under certain conditions the fraction of "bad" global minima with a true error larger than {\epsilon} decays to zero exponentially fast with the number of training data n. The bound depends on the distribution of the true error over the set of classifier functions used for the given classification problem, and does not necessarily depend on the size or complexity (e.g. the number of parameters) of the classifier function set. This insight may provide a novel perspective on the unexpectedly good generalization even of highly over-parameterized neural networks. We substantiate our theoretical findings through experiments on synthetic data and a subset of MNIST. Additionally, we assess our hypothesis using VGG19 and ResNet18 on a subset of Caltech101.

Related papers

Rethinking generalization of classifiers in separable classes scenarios and over-parameterized regimes [0.0]
We show that in separable classes scenarios the proportion of "bad" global minima diminishes exponentially with the number of training data n. We propose a model for the density distribution of the true error, yielding learning curves that align with experiments on MNIST and CIFAR-10.
arXiv Detail & Related papers (2024-10-22T10:12:57Z)
Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters. In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z)
Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification. Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z)
Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z)
Overparameterized ReLU Neural Networks Learn the Simplest Models: Neural Isometry and Exact Recovery [33.74925020397343]
Deep learning has shown that neural networks generalize remarkably well even with an extreme number of learned parameters. We consider the training and generalization properties of two-layer ReLU networks with standard weight decay regularization. We show that ReLU networks learn simple and sparse models even when the labels are noisy.
arXiv Detail & Related papers (2022-09-30T06:47:15Z)
Predicting Unreliable Predictions by Shattering a Neural Network [145.3823991041987]
Piecewise linear neural networks can be split into subfunctions. Subfunctions have their own activation pattern, domain, and empirical error. Empirical error for the full network can be written as an expectation over subfunctions.
arXiv Detail & Related papers (2021-06-15T18:34:41Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
Self-Regularity of Non-Negative Output Weights for Overparameterized Two-Layer Neural Networks [16.64116123743938]
We consider the problem of finding a two-layer neural network with sigmoid, rectified linear unit (ReLU) We then leverage our bounds to establish guarantees for such networks through emphfat-shattering dimension Notably, our bounds also have good sample complexity (polynomials in $d$ with a low degree)
arXiv Detail & Related papers (2021-03-02T17:36:03Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.