From Tempered to Benign Overfitting in ReLU Neural Networks
- URL: http://arxiv.org/abs/2305.15141v3
- Date: Thu, 21 Mar 2024 10:15:19 GMT
- Title: From Tempered to Benign Overfitting in ReLU Neural Networks
- Authors: Guy Kornowski, Gilad Yehudai, Ohad Shamir,
- Abstract summary: Overtrivialized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data.
Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting"
- Score: 41.271773069796126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on "benign overfitting", where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting", where the performance is non-optimal yet also non-trivial, and degrades as a function of the noise level. However, a theoretical justification of this claim for non-linear NNs has been lacking so far. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions. Thus, we show that the input dimension has a crucial role on the type of overfitting in this setting, which we also validate empirically for intermediate dimensions. Overall, our results shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and the type of resulting overfitting on the other hand.
Related papers
- Benign Overfitting in Single-Head Attention [27.297696573634976]
We study benign overfitting in a single-head softmax attention model, which is the fundamental building block of Transformers.
We prove that under appropriate conditions, the model exhibits benign overfitting in a classification setting already after two steps of gradient descent.
arXiv Detail & Related papers (2024-10-10T09:23:33Z) - A PAC-Bayesian Perspective on the Interpolating Information Criterion [54.548058449535155]
We show how a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence performance in the interpolating regime.
We quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, parameter-initialization scheme.
arXiv Detail & Related papers (2023-11-13T01:48:08Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension [17.96183484063563]
We show that the smoothness of the estimators, and not the dimension, is the key to benign overfitting.
We translate our results to wide neural networks.
Our experiments verify that such neural networks, while overfitting, can indeed generalize well even on low-dimensional data sets.
arXiv Detail & Related papers (2023-05-23T13:56:29Z) - On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural
Networks with Linear Activations [0.0]
We investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation.
We show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized.
arXiv Detail & Related papers (2023-05-17T02:26:34Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - Robustness to Pruning Predicts Generalization in Deep Neural Networks [29.660568281957072]
We introduce prunability: the smallest emphfraction of a network's parameters that can be kept while pruning without adversely affecting its training loss.
We show that this measure is highly predictive of a model's generalization performance across a large set of convolutional networks trained on CIFAR-10.
arXiv Detail & Related papers (2021-03-10T11:39:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.