Generalization Guarantees of Gradient Descent for Multi-Layer Neural
Networks
- URL: http://arxiv.org/abs/2305.16891v2
- Date: Fri, 29 Sep 2023 07:17:50 GMT
- Title: Generalization Guarantees of Gradient Descent for Multi-Layer Neural
Networks
- Authors: Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, Ding-Xuan Zhou
- Abstract summary: We conduct a comprehensive stability and generalization analysis of gradient descent (GD) for multi-layer NNs.
We derive the excess risk rate of $O(1/sqrtn)$ for GD algorithms in both two-layer and three-layer NNs.
- Score: 55.86300309474023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, significant progress has been made in understanding the
generalization of neural networks (NNs) trained by gradient descent (GD) using
the algorithmic stability approach. However, most of the existing research has
focused on one-hidden-layer NNs and has not addressed the impact of different
network scaling parameters. In this paper, we greatly extend the previous work
\cite{lei2022stability,richards2021stability} by conducting a comprehensive
stability and generalization analysis of GD for multi-layer NNs. For two-layer
NNs, our results are established under general network scaling parameters,
relaxing previous conditions. In the case of three-layer NNs, our technical
contribution lies in demonstrating its nearly co-coercive property by utilizing
a novel induction strategy that thoroughly explores the effects of
over-parameterization. As a direct application of our general findings, we
derive the excess risk rate of $O(1/\sqrt{n})$ for GD algorithms in both
two-layer and three-layer NNs. This sheds light on sufficient or necessary
conditions for under-parameterized and over-parameterized NNs trained by GD to
attain the desired risk rate of $O(1/\sqrt{n})$. Moreover, we demonstrate that
as the scaling parameter increases or the network complexity decreases, less
over-parameterization is required for GD to achieve the desired error rates.
Additionally, under a low-noise condition, we obtain a fast risk rate of
$O(1/n)$ for GD in both two-layer and three-layer NNs.
Related papers
- On the Convergence of (Stochastic) Gradient Descent for Kolmogorov--Arnold Networks [56.78271181959529]
Kolmogorov--Arnold Networks (KANs) have gained significant attention in the deep learning community.
Empirical investigations demonstrate that KANs optimized via gradient descent (SGD) are capable of achieving near-zero training loss.
arXiv Detail & Related papers (2024-10-10T15:34:10Z) - Stochastic Gradient Descent for Two-layer Neural Networks [2.0349026069285423]
This paper presents a study on the convergence rates of the descent (SGD) algorithm when applied to overparameterized two-layer neural networks.
Our approach combines the Tangent Kernel (NTK) approximation with convergence analysis in the Reproducing Kernel Space (RKHS) generated by NTK.
Our research framework enables us to explore the intricate interplay between kernel methods and optimization processes, shedding light on the dynamics and convergence properties of neural networks.
arXiv Detail & Related papers (2024-07-10T13:58:57Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - Generalization Error Bounds for Deep Neural Networks Trained by SGD [3.148524502470734]
Generalization error bounds for deep trained by gradient descent (SGD) are derived.
The bounds explicitly depend on the loss along the training trajectory.
Results show that our bounds are non-vacuous and robust with the change of neural networks and network hypers.
arXiv Detail & Related papers (2022-06-07T13:46:10Z) - A Convergence Analysis of Nesterov's Accelerated Gradient Method in
Training Deep Linear Neural Networks [21.994004684742812]
Momentum methods are widely used in training networks for their fast trajectory.
We show that the convergence of the random number and $kappaO can converge to the global minimum.
We extend our analysis to deep linear ResNets and derive a similar result.
arXiv Detail & Related papers (2022-04-18T13:24:12Z) - Regularizing Recurrent Neural Networks via Sequence Mixup [7.036759195546171]
We extend a class of celebrated regularization techniques originally proposed for feed-forward neural networks.
Our proposed methods are easy to implement complexity, while leverage the performance of simple neural architectures.
arXiv Detail & Related papers (2020-11-27T05:43:40Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.