Neural Networks and Polynomial Regression. Demystifying the
Overparametrization Phenomena
- URL: http://arxiv.org/abs/2003.10523v1
- Date: Mon, 23 Mar 2020 20:09:31 GMT
- Title: Neural Networks and Polynomial Regression. Demystifying the
Overparametrization Phenomena
- Authors: Matt Emschwiller, David Gamarnik, Eren C. K{\i}z{\i}lda\u{g}, Ilias
Zadik
- Abstract summary: In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data.
A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the data.
We show that any student network interpolating the data generated by a teacher network generalizes well, provided that the sample size is at least an explicit quantity controlled by data dimension.
- Score: 17.205106391379026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the context of neural network models, overparametrization refers to the
phenomena whereby these models appear to generalize well on the unseen data,
even though the number of parameters significantly exceeds the sample sizes,
and the model perfectly fits the in-training data. A conventional explanation
of this phenomena is based on self-regularization properties of algorithms used
to train the data. In this paper we prove a series of results which provide a
somewhat diverging explanation. Adopting a teacher/student model where the
teacher network is used to generate the predictions and student network is
trained on the observed labeled data, and then tested on out-of-sample data, we
show that any student network interpolating the data generated by a teacher
network generalizes well, provided that the sample size is at least an explicit
quantity controlled by data dimension and approximation guarantee alone,
regardless of the number of internal nodes of either teacher or student
network.
Our claim is based on approximating both teacher and student networks by
polynomial (tensor) regression models with degree depending on the desired
accuracy and network depth only. Such a parametrization notably does not depend
on the number of internal nodes. Thus a message implied by our results is that
parametrizing wide neural networks by the number of hidden nodes is misleading,
and a more fitting measure of parametrization complexity is the number of
regression coefficients associated with tensorized data. In particular, this
somewhat reconciles the generalization ability of neural networks with more
classical statistical notions of data complexity and generalization bounds. Our
empirical results on MNIST and Fashion-MNIST datasets indeed confirm that
tensorized regression achieves a good out-of-sample performance, even when the
degree of the tensor is at most two.
Related papers
- Assessing Neural Network Representations During Training Using
Noise-Resilient Diffusion Spectral Entropy [55.014926694758195]
Entropy and mutual information in neural networks provide rich information on the learning process.
We leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures.
We show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data.
arXiv Detail & Related papers (2023-12-04T01:32:42Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Overparameterized ReLU Neural Networks Learn the Simplest Models: Neural
Isometry and Exact Recovery [33.74925020397343]
Deep learning has shown that neural networks generalize remarkably well even with an extreme number of learned parameters.
We consider the training and generalization properties of two-layer ReLU networks with standard weight decay regularization.
We show that ReLU networks learn simple and sparse models even when the labels are noisy.
arXiv Detail & Related papers (2022-09-30T06:47:15Z) - Robust Generalization of Quadratic Neural Networks via Function
Identification [19.87036824512198]
Generalization bounds from learning theory often assume that the test distribution is close to the training distribution.
We show that for quadratic neural networks, we can identify the function represented by the model even though we cannot identify its parameters.
arXiv Detail & Related papers (2021-09-22T18:02:00Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Slope and generalization properties of neural networks [0.0]
We show that the distribution of the slope of a well-trained neural network classifier is generally independent of the width of the layers in a fully connected network.
The slope is of similar size throughout the relevant volume, and varies smoothly. It also behaves as predicted in rescaling examples.
We discuss possible applications of the slope concept, such as using it as a part of the loss function or stopping criterion during network training, or ranking data sets in terms of their complexity.
arXiv Detail & Related papers (2021-07-03T17:54:27Z) - Redundant representations help generalization in wide neural networks [71.38860635025907]
We study the last hidden layer representations of various state-of-the-art convolutional neural networks.
We find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise.
arXiv Detail & Related papers (2021-06-07T10:18:54Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - More data or more parameters? Investigating the effect of data structure
on generalization [17.249712222764085]
Properties of data impact the test error as a function of the number of training examples and number of training parameters.
We show that noise in the labels and strong anisotropy of the input data play similar roles on the test error.
arXiv Detail & Related papers (2021-03-09T16:08:41Z) - Synthesizing Irreproducibility in Deep Networks [2.28438857884398]
Modern day deep networks suffer from irreproducibility (also referred to as nondeterminism or underspecification)
We show that even with a single nonlinearity and for very simple data and models, irreproducibility occurs.
Model complexity and the choice of nonlinearity also play significant roles in making deep models irreproducible.
arXiv Detail & Related papers (2021-02-21T21:51:28Z) - Category-Learning with Context-Augmented Autoencoder [63.05016513788047]
Finding an interpretable non-redundant representation of real-world data is one of the key problems in Machine Learning.
We propose a novel method of using data augmentations when training autoencoders.
We train a Variational Autoencoder in such a way, that it makes transformation outcome predictable by auxiliary network.
arXiv Detail & Related papers (2020-10-10T14:04:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.