Regularization Matters: A Nonparametric Perspective on Overparametrized
Neural Network
- URL: http://arxiv.org/abs/2007.02486v2
- Date: Sat, 25 Sep 2021 12:42:26 GMT
- Title: Regularization Matters: A Nonparametric Perspective on Overparametrized
Neural Network
- Authors: Tianyang Hu, Wenjia Wang, Cong Lin, Guang Cheng
- Abstract summary: Overparametrized neural networks trained by tangent descent (GD) can provably overfit any training data.
This paper studies how well overparametrized neural networks can recover the true target function in the presence of random noises.
- Score: 20.132432350255087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Overparametrized neural networks trained by gradient descent (GD) can
provably overfit any training data. However, the generalization guarantee may
not hold for noisy data. From a nonparametric perspective, this paper studies
how well overparametrized neural networks can recover the true target function
in the presence of random noises. We establish a lower bound on the $L_2$
estimation error with respect to the GD iterations, which is away from zero
without a delicate scheme of early stopping. In turn, through a comprehensive
analysis of $\ell_2$-regularized GD trajectories, we prove that for
overparametrized one-hidden-layer ReLU neural network with the $\ell_2$
regularization: (1) the output is close to that of the kernel ridge regression
with the corresponding neural tangent kernel; (2) minimax {optimal} rate of
$L_2$ estimation error can be achieved. Numerical experiments confirm our
theory and further demonstrate that the $\ell_2$ regularization approach
improves the training robustness and works for a wider range of neural
networks.
Related papers
- Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression: A Distribution-Free Analysis [19.988762532185884]
We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of the nonparametric regression risk of $cO(eps_n2)$.
It is remarked that our result does not require distributional assumptions on the training data.
arXiv Detail & Related papers (2024-11-05T08:43:54Z) - Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression [8.130817534654089]
We consider nonparametric regression by a two-layer neural network trained by gradient descent (GD) or its variant in this paper.
We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $cO(1/n4alpha/(4alpha+1)$.
arXiv Detail & Related papers (2024-07-16T03:38:34Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Provable Identifiability of Two-Layer ReLU Neural Networks via LASSO
Regularization [15.517787031620864]
The territory of LASSO is extended to two-layer ReLU neural networks, a fashionable and powerful nonlinear regression model.
We show that the LASSO estimator can stably reconstruct the neural network and identify $mathcalSstar$ when the number of samples scales logarithmically.
Our theory lies in an extended Restricted Isometry Property (RIP)-based analysis framework for two-layer ReLU neural networks.
arXiv Detail & Related papers (2023-05-07T13:05:09Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Nonparametric Regression with Shallow Overparameterized Neural Networks
Trained by GD with Early Stopping [11.24426822697648]
We show that trained neural networks are smooth with respect to their inputs when trained by Gradient Descent (GD)
In the noise-free case the proof does not rely on any kernelization and can be regarded as a finite-width result.
arXiv Detail & Related papers (2021-07-12T11:56:53Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Sample Complexity and Overparameterization Bounds for Projection-Free
Neural TD Learning [38.730333068555275]
Existing analysis of neural TD learning relies on either infinite width-analysis or constraining the network parameters in a (random) compact set.
We show that the projection-free TD learning equipped with a two-layer ReLU network of any width exceeding $poly(overlinenu,1/epsilon)$ converges to the true value function with error $epsilon$ given $poly(overlinenu,1/epsilon)$ iterations or samples.
arXiv Detail & Related papers (2021-03-02T01:05:19Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.