Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime
- URL: http://arxiv.org/abs/2006.12297v2
- Date: Fri, 11 Jun 2021 14:51:53 GMT
- Title: Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime
- Authors: Atsushi Nitanda, Taiji Suzuki
- Abstract summary: We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
- Score: 50.510421854168065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We analyze the convergence of the averaged stochastic gradient descent for
overparameterized two-layer neural networks for regression problems. It was
recently found that a neural tangent kernel (NTK) plays an important role in
showing the global convergence of gradient-based methods under the NTK regime,
where the learning dynamics for overparameterized neural networks can be almost
characterized by that for the associated reproducing kernel Hilbert space
(RKHS). However, there is still room for a convergence rate analysis in the NTK
regime. In this study, we show that the averaged stochastic gradient descent
can achieve the minimax optimal convergence rate, with the global convergence
guarantee, by exploiting the complexities of the target function and the RKHS
associated with the NTK. Moreover, we show that the target function specified
by the NTK of a ReLU network can be learned at the optimal convergence rate
through a smooth approximation of a ReLU network under certain conditions.
Related papers
- Stochastic Gradient Descent for Two-layer Neural Networks [2.0349026069285423]
This paper presents a study on the convergence rates of the descent (SGD) algorithm when applied to overparameterized two-layer neural networks.
Our approach combines the Tangent Kernel (NTK) approximation with convergence analysis in the Reproducing Kernel Space (RKHS) generated by NTK.
Our research framework enables us to explore the intricate interplay between kernel methods and optimization processes, shedding light on the dynamics and convergence properties of neural networks.
arXiv Detail & Related papers (2024-07-10T13:58:57Z) - How many Neurons do we need? A refined Analysis for Shallow Networks
trained with Gradient Descent [0.0]
We analyze the generalization properties of two-layer neural networks in the neural tangent kernel regime.
We derive fast rates of convergence that are known to be minimax optimal in the framework of non-parametric regression.
arXiv Detail & Related papers (2023-09-14T22:10:28Z) - A Neural Network-Based Enrichment of Reproducing Kernel Approximation
for Modeling Brittle Fracture [0.0]
An improved version of the neural network-enhanced Reproducing Kernel Particle Method (NN-RKPM) is proposed for modeling brittle fracture.
The effectiveness of the proposed method is demonstrated by a series of numerical examples involving damage propagation and branching.
arXiv Detail & Related papers (2023-07-04T21:52:09Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Improved Overparametrization Bounds for Global Convergence of Stochastic
Gradient Descent for Shallow Neural Networks [1.14219428942199]
We study the overparametrization bounds required for the global convergence of gradient descent algorithm for a class of one hidden layer feed-forward neural networks.
arXiv Detail & Related papers (2022-01-28T11:30:06Z) - Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets.
We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels.
We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z) - Weighted Neural Tangent Kernel: A Generalized and Improved
Network-Induced Kernel [20.84988773171639]
The Neural Tangent Kernel (NTK) has recently attracted intense study, as it describes the evolution of an over- parameterized Neural Network (NN) trained by gradient descent.
We introduce the Weighted Neural Tangent Kernel (WNTK), a generalized and improved tool, which can capture an over- parameterized NN's training dynamics under different gradients.
With the proposed weight update algorithm, both empirical and analytical WNTKs outperform the corresponding NTKs in numerical experiments.
arXiv Detail & Related papers (2021-03-22T03:16:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.