Overparameterized ReLU Neural Networks Learn the Simplest Models: Neural
Isometry and Exact Recovery
- URL: http://arxiv.org/abs/2209.15265v2
- Date: Tue, 4 Oct 2022 05:00:43 GMT
- Title: Overparameterized ReLU Neural Networks Learn the Simplest Models: Neural
Isometry and Exact Recovery
- Authors: Yifei Wang, Yixuan Hua, Emmanuel Cand\'es, Mert Pilanci
- Abstract summary: Deep learning has shown that neural networks generalize remarkably well even with an extreme number of learned parameters.
We consider the training and generalization properties of two-layer ReLU networks with standard weight decay regularization.
We show that ReLU networks learn simple and sparse models even when the labels are noisy.
- Score: 33.74925020397343
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The practice of deep learning has shown that neural networks generalize
remarkably well even with an extreme number of learned parameters. This appears
to contradict traditional statistical wisdom, in which a trade-off between
model complexity and fit to the data is essential. We set out to resolve this
discrepancy from a convex optimization and sparse recovery perspective. We
consider the training and generalization properties of two-layer ReLU networks
with standard weight decay regularization. Under certain regularity assumptions
on the data, we show that ReLU networks with an arbitrary number of parameters
learn only simple models that explain the data. This is analogous to the
recovery of the sparsest linear model in compressed sensing. For ReLU networks
and their variants with skip connections or normalization layers, we present
isometry conditions that ensure the exact recovery of planted neurons. For
randomly generated data, we show the existence of a phase transition in
recovering planted neural network models. The situation is simple: whenever the
ratio between the number of samples and the dimension exceeds a numerical
threshold, the recovery succeeds with high probability; otherwise, it fails
with high probability. Surprisingly, ReLU networks learn simple and sparse
models even when the labels are noisy. The phase transition phenomenon is
confirmed through numerical experiments.
Related papers
- Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - LARA: A Light and Anti-overfitting Retraining Approach for Unsupervised
Time Series Anomaly Detection [49.52429991848581]
We propose a Light and Anti-overfitting Retraining Approach (LARA) for deep variational auto-encoder based time series anomaly detection methods (VAEs)
This work aims to make three novel contributions: 1) the retraining process is formulated as a convex problem and can converge at a fast rate as well as prevent overfitting; 2) designing a ruminate block, which leverages the historical data without the need to store them; and 3) mathematically proving that when fine-tuning the latent vector and reconstructed data, the linear formations can achieve the least adjusting errors between the ground truths and the fine-tuned ones.
arXiv Detail & Related papers (2023-10-09T12:36:16Z) - A Scalable Walsh-Hadamard Regularizer to Overcome the Low-degree
Spectral Bias of Neural Networks [79.28094304325116]
Despite the capacity of neural nets to learn arbitrary functions, models trained through gradient descent often exhibit a bias towards simpler'' functions.
We show how this spectral bias towards low-degree frequencies can in fact hurt the neural network's generalization on real-world datasets.
We propose a new scalable functional regularization scheme that aids the neural network to learn higher degree frequencies.
arXiv Detail & Related papers (2023-05-16T20:06:01Z) - Benign Overfitting for Two-layer ReLU Convolutional Neural Networks [60.19739010031304]
We establish algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise.
We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk.
arXiv Detail & Related papers (2023-03-07T18:59:38Z) - More is Less: Inducing Sparsity via Overparameterization [2.885175627590247]
In deep learning it is common to over parameterize neural networks, that is, to use more parameters than training samples.
Quite surprisingly, generalize the neural network via (stochastic) gradient descent leads to that very well.
Our proof relies on analyzing a certain Bregman divergence of the flow.
arXiv Detail & Related papers (2021-12-21T07:55:55Z) - Robust Generalization of Quadratic Neural Networks via Function
Identification [19.87036824512198]
Generalization bounds from learning theory often assume that the test distribution is close to the training distribution.
We show that for quadratic neural networks, we can identify the function represented by the model even though we cannot identify its parameters.
arXiv Detail & Related papers (2021-09-22T18:02:00Z) - Slope and generalization properties of neural networks [0.0]
We show that the distribution of the slope of a well-trained neural network classifier is generally independent of the width of the layers in a fully connected network.
The slope is of similar size throughout the relevant volume, and varies smoothly. It also behaves as predicted in rescaling examples.
We discuss possible applications of the slope concept, such as using it as a part of the loss function or stopping criterion during network training, or ranking data sets in terms of their complexity.
arXiv Detail & Related papers (2021-07-03T17:54:27Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - System Identification Through Lipschitz Regularized Deep Neural Networks [0.4297070083645048]
We use neural networks to learn governing equations from data.
We reconstruct the right-hand side of a system of ODEs $dotx(t) = f(t, x(t))$ directly from observed uniformly time-sampled data.
arXiv Detail & Related papers (2020-09-07T17:52:51Z) - Neural Networks and Polynomial Regression. Demystifying the
Overparametrization Phenomena [17.205106391379026]
In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data.
A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the data.
We show that any student network interpolating the data generated by a teacher network generalizes well, provided that the sample size is at least an explicit quantity controlled by data dimension.
arXiv Detail & Related papers (2020-03-23T20:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.