How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers
- URL: http://arxiv.org/abs/2402.06323v3
- Date: Sat, 15 Feb 2025 15:21:10 GMT
- Title: How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers
- Authors: Gon Buzaglo, Itamar Harel, Mor Shpigel Nacson, Alon Brutzkus, Nathan Srebro, Daniel Soudry,
- Abstract summary: We show that a flat' prior over the NN parameterization induces a rich prior over the NN functions.
This creates a bias towards simpler functions, which require less relevant parameters to represent.
- Score: 37.54951110709193
- License:
- Abstract: Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifies the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. Contributions. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow ``teacher NN'' that agrees with the labels. Specifically, we show that such a `flat' prior over the NN parameterization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent -- enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student's.
Related papers
- Randomness of Low-Layer Parameters Determines Confusing Samples in Terms of Interaction Representations of a DNN [67.80700786901016]
We find that the complexity of interactions encoded by a deep neural network (DNN) can explain its generalization power.
We also discover that the confusing samples of a DNN, which are represented by non-generalizable interactions, are determined by its low-layer parameters.
arXiv Detail & Related papers (2025-02-12T18:25:13Z) - Neural Redshift: Random Networks are not Random Functions [28.357640341268745]
We show that NNs do not have an inherent "simplicity bias"
Alternative architectures can be built with a bias for any level of complexity.
It points to promising avenues for controlling the solutions implemented by trained models.
arXiv Detail & Related papers (2024-03-04T17:33:20Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Do Neural Networks Generalize from Self-Averaging Sub-classifiers in the
Same Way As Adaptive Boosting? [0.0]
Our research seeks to explain why neural networks generalize.
To our knowledge, we are the first authors to establish the connection between generalization in boosted classifiers and generalization in deep NNs.
Our experimental evidence and theoretical analysis suggest NNs trained with dropout exhibit similar self-averaging behavior over interpolating sub-classifiers as cited in popular explanations for the post-interpolation generalization phenomenon in boosting.
arXiv Detail & Related papers (2023-02-14T09:20:33Z) - Norm-based Generalization Bounds for Compositionally Sparse Neural
Networks [11.987589603961622]
We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks.
Taken together, these results suggest that compositional sparsity of the underlying target function is critical to the success of deep neural networks.
arXiv Detail & Related papers (2023-01-28T00:06:22Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a
Polynomial Net Study [55.12108376616355]
The study on NTK has been devoted to typical neural network architectures, but is incomplete for neural networks with Hadamard products (NNs-Hp)
In this work, we derive the finite-width-K formulation for a special class of NNs-Hp, i.e., neural networks.
We prove their equivalence to the kernel regression predictor with the associated NTK, which expands the application scope of NTK.
arXiv Detail & Related papers (2022-09-16T06:36:06Z) - Deep Learning meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive? [16.105097124039602]
We study the theory of neural network (NN) from the lens of classical nonparametric regression problems.
Our research sheds new lights on why depth matters and how NNs are more powerful than kernel methods.
arXiv Detail & Related papers (2022-04-20T17:55:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.