Learning Lipschitz Functions by GD-trained Shallow Overparameterized
ReLU Neural Networks
- URL: http://arxiv.org/abs/2212.13848v2
- Date: Thu, 6 Apr 2023 15:01:30 GMT
- Title: Learning Lipschitz Functions by GD-trained Shallow Overparameterized
ReLU Neural Networks
- Authors: Ilja Kuzborskij, Csaba Szepesv\'ari
- Abstract summary: We show that neural networks trained to nearly zero training error are inconsistent in this class.
We show that whenever some early stopping rule is guaranteed to give an optimal rate (of excess risk) on the Hilbert space of the kernel induced by the ReLU activation function, the same rule can be used to achieve minimax optimal rate.
- Score: 12.018422134251384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We explore the ability of overparameterized shallow ReLU neural networks to
learn Lipschitz, nondifferentiable, bounded functions with additive noise when
trained by Gradient Descent (GD). To avoid the problem that in the presence of
noise, neural networks trained to nearly zero training error are inconsistent
in this class, we focus on the early-stopped GD which allows us to show
consistency and optimal rates. In particular, we explore this problem from the
viewpoint of the Neural Tangent Kernel (NTK) approximation of a GD-trained
finite-width neural network. We show that whenever some early stopping rule is
guaranteed to give an optimal rate (of excess risk) on the Hilbert space of the
kernel induced by the ReLU activation function, the same rule can be used to
achieve minimax optimal rate for learning on the class of considered Lipschitz
functions by neural networks. We discuss several data-free and data-dependent
practically appealing stopping rules that yield optimal rates.
Related papers
- Controlling the Inductive Bias of Wide Neural Networks by Modifying the Kernel's Spectrum [18.10812063219831]
We introduce Modified Spectrum Kernels (MSKs) to approximate kernels with desired eigenvalues.
We propose a preconditioned gradient descent method, which alters the trajectory of gradient descent.
Our method is both computationally efficient and simple to implement.
arXiv Detail & Related papers (2023-07-26T22:39:47Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Benign Overfitting for Two-layer ReLU Convolutional Neural Networks [60.19739010031304]
We establish algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise.
We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk.
arXiv Detail & Related papers (2023-03-07T18:59:38Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Training Certifiably Robust Neural Networks with Efficient Local
Lipschitz Bounds [99.23098204458336]
Certified robustness is a desirable property for deep neural networks in safety-critical applications.
We show that our method consistently outperforms state-of-the-art methods on MNIST and TinyNet datasets.
arXiv Detail & Related papers (2021-11-02T06:44:10Z) - Nonparametric Regression with Shallow Overparameterized Neural Networks
Trained by GD with Early Stopping [11.24426822697648]
We show that trained neural networks are smooth with respect to their inputs when trained by Gradient Descent (GD)
In the noise-free case the proof does not rely on any kernelization and can be regarded as a finite-width result.
arXiv Detail & Related papers (2021-07-12T11:56:53Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.