Related papers: Towards an Understanding of Benign Overfitting in Neural Networks

Towards an Understanding of Benign Overfitting in Neural Networks

URL: http://arxiv.org/abs/2106.03212v1
Date: Sun, 6 Jun 2021 19:08:53 GMT
Title: Towards an Understanding of Benign Overfitting in Neural Networks
Authors: Zhu Li, Zhi-Hua Zhou, Arthur Gretton
Abstract summary: Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
Score: 104.2956323934544
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss; yet surprisingly, they possess near-optimal prediction performance, contradicting classical learning theory. We examine how these benign overfitting phenomena occur in a two-layer neural network setting where sample covariates are corrupted with noise. We address the high dimensional regime, where the data dimension $d$ grows with the number $n$ of data points. Our analysis combines an upper bound on the bias with matching upper and lower bounds on the variance of the interpolator (an estimator that interpolates the data). These results indicate that the excess learning risk of the interpolator decays under mild conditions. We further show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate, which to our knowledge is the first generalization result for such networks. Finally, our theory predicts that the excess learning risk starts to increase once the number of parameters $s$ grows beyond $O(n^2)$, matching recent empirical findings.

Related papers

Optimal generalisation and learning transition in extensive-width shallow neural networks near interpolation [4.976898227858662]
We consider a teacher-student model of supervised learning with a fully-trained two-layer neural network. We provide an effective theory for approximating the Bayes-optimal generalisation error of the network for any activation function.
arXiv Detail & Related papers (2025-01-30T17:56:52Z)
Deep learning from strongly mixing observations: Sparse-penalized regularization and minimax optimality [0.0]
We consider sparse-penalized regularization for deep neural network predictor. We deal with the squared and a broad class of loss functions.
arXiv Detail & Related papers (2024-06-12T15:21:51Z)
Benign Overfitting for Two-layer ReLU Convolutional Neural Networks [60.19739010031304]
We establish algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise. We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk.
arXiv Detail & Related papers (2023-03-07T18:59:38Z)
Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum. Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels. They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z)
Multi-scale Feature Learning Dynamics: Insights for Double Descent [71.91871020059857]
We study the phenomenon of "double descent" of the generalization error. We find that double descent can be attributed to distinct features being learned at different scales.
arXiv Detail & Related papers (2021-12-06T18:17:08Z)
Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function. We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z)
The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability. We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z)
The Rate of Convergence of Variation-Constrained Deep Neural Networks [35.393855471751756]
We show that a class of variation-constrained neural networks can achieve near-parametric rate $n-1/2+delta$ for an arbitrarily small constant $delta$. The result indicates that the neural function space needed for approximating smooth functions may not be as large as what is often perceived.
arXiv Detail & Related papers (2021-06-22T21:28:00Z)
The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training [10.72393527290646]
We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
arXiv Detail & Related papers (2020-07-25T01:51:13Z)
Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network [20.132432350255087]
Overparametrized neural networks trained by tangent descent (GD) can provably overfit any training data. This paper studies how well overparametrized neural networks can recover the true target function in the presence of random noises.
arXiv Detail & Related papers (2020-07-06T01:02:23Z)
A Revision of Neural Tangent Kernel-based Approaches for Neural Networks [34.75076385561115]
We use the neural tangent kernel to show that networks can fit any finite training sample perfectly. A simple and analytic kernel function was derived as indeed equivalent to a fully-trained network. Our tighter analysis resolves the scaling problem and enables the validation of the original NTK-based results.
arXiv Detail & Related papers (2020-07-02T05:07:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.