One-pass Stochastic Gradient Descent in Overparametrized Two-layer
Neural Networks
- URL: http://arxiv.org/abs/2105.00262v1
- Date: Sat, 1 May 2021 14:34:03 GMT
- Title: One-pass Stochastic Gradient Descent in Overparametrized Two-layer
Neural Networks
- Authors: Jiaming Xu and Hanjing Zhu
- Abstract summary: We study the streaming data setup and show that the prediction error of two-layer neural networks under one-pass SGD converges in expectation.
The convergence rate depends on the eigen-decomposition of the integral operator associated with the so-called neural tangent kernel (NTK)
A key step of our analysis is to show a random kernel function converges to the NTK with high probability using the VC dimension and McDiarmid's inequality.
- Score: 15.789476296152559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been a recent surge of interest in understanding the convergence of
gradient descent (GD) and stochastic gradient descent (SGD) in
overparameterized neural networks. Most previous works assume that the training
data is provided a priori in a batch, while less attention has been paid to the
important setting where the training data arrives in a stream. In this paper,
we study the streaming data setup and show that with overparamterization and
random initialization, the prediction error of two-layer neural networks under
one-pass SGD converges in expectation. The convergence rate depends on the
eigen-decomposition of the integral operator associated with the so-called
neural tangent kernel (NTK). A key step of our analysis is to show a random
kernel function converges to the NTK with high probability using the VC
dimension and McDiarmid's inequality.
Related papers
- Observation Noise and Initialization in Wide Neural Networks [9.163214210191814]
We introduce a textitshifted network that enables arbitrary prior mean functions.
Our theoretical insights are validated empirically, with experiments exploring different values of observation noise and network architectures.
arXiv Detail & Related papers (2025-02-03T17:39:45Z) - Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression [19.988762532185884]
We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of nonparametric regression risk.
$cO(eps_n2)$ is the same rate as that for the classical kernel regression trained by GD with early stopping.
arXiv Detail & Related papers (2024-11-05T08:43:54Z) - Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression [8.130817534654089]
We consider nonparametric regression by a two-layer neural network trained by gradient descent (GD) or its variant in this paper.
We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $cO(1/n4alpha/(4alpha+1)$.
arXiv Detail & Related papers (2024-07-16T03:38:34Z) - How many Neurons do we need? A refined Analysis for Shallow Networks
trained with Gradient Descent [0.0]
We analyze the generalization properties of two-layer neural networks in the neural tangent kernel regime.
We derive fast rates of convergence that are known to be minimax optimal in the framework of non-parametric regression.
arXiv Detail & Related papers (2023-09-14T22:10:28Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.