Stability & Generalisation of Gradient Descent for Shallow Neural
Networks without the Neural Tangent Kernel
- URL: http://arxiv.org/abs/2107.12723v1
- Date: Tue, 27 Jul 2021 10:53:15 GMT
- Title: Stability & Generalisation of Gradient Descent for Shallow Neural
Networks without the Neural Tangent Kernel
- Authors: Dominic Richards, Ilja Kuzborskij
- Abstract summary: We prove new generalisation and excess risk bounds without the Neural Tangent Kernel (NTK) or Polyak-Lojasiewicz (PL) assumptions.
We show oracle type bounds which reveal that the generalisation and excess risk of GD is controlled by an interpolating network with the shortest GD path from initialisation.
Unlike most of the NTK-based analyses we focus on regression with label noise and show that GD with early stopping is consistent.
- Score: 19.4934492061353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We revisit on-average algorithmic stability of Gradient Descent (GD) for
training overparameterised shallow neural networks and prove new generalisation
and excess risk bounds without the Neural Tangent Kernel (NTK) or
Polyak-{\L}ojasiewicz (PL) assumptions. In particular, we show oracle type
bounds which reveal that the generalisation and excess risk of GD is controlled
by an interpolating network with the shortest GD path from initialisation (in a
sense, an interpolating network with the smallest relative norm). While this
was known for kernelised interpolants, our proof applies directly to networks
trained by GD without intermediate kernelisation. At the same time, by relaxing
oracle inequalities developed here we recover existing NTK-based risk bounds in
a straightforward way, which demonstrates that our analysis is tighter.
Finally, unlike most of the NTK-based analyses we focus on regression with
label noise and show that GD with early stopping is consistent.
Related papers
- How many Neurons do we need? A refined Analysis for Shallow Networks
trained with Gradient Descent [0.0]
We analyze the generalization properties of two-layer neural networks in the neural tangent kernel regime.
We derive fast rates of convergence that are known to be minimax optimal in the framework of non-parametric regression.
arXiv Detail & Related papers (2023-09-14T22:10:28Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.