Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks
- URL: http://arxiv.org/abs/2209.09298v1
- Date: Mon, 19 Sep 2022 18:48:00 GMT
- Title: Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks
- Authors: Yunwen Lei, Rong Jin, Yiming Ying
- Abstract summary: We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
- Score: 59.142826407441106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While significant theoretical progress has been achieved, unveiling the
generalization mystery of overparameterized neural networks still remains
largely elusive. In this paper, we study the generalization behavior of shallow
neural networks (SNNs) by leveraging the concept of algorithmic stability. We
consider gradient descent (GD) and stochastic gradient descent (SGD) to train
SNNs, for both of which we develop consistent excess risk bounds by balancing
the optimization and generalization via early-stopping. As compared to existing
analysis on GD, our new analysis requires a relaxed overparameterization
assumption and also applies to SGD. The key for the improvement is a better
estimation of the smallest eigenvalues of the Hessian matrices of the empirical
risks and the loss function along the trajectories of GD and SGD by providing a
refined estimation of their iterates.
Related papers
- Stability and Generalization of the Decentralized Stochastic Gradient
Descent Ascent Algorithm [80.94861441583275]
We investigate the complexity of the generalization bound of the decentralized gradient descent (D-SGDA) algorithm.
Our results analyze the impact of different top factors on the generalization of D-SGDA.
We also balance it with the generalization to obtain the optimal convex-concave setting.
arXiv Detail & Related papers (2023-10-31T11:27:01Z) - How many Neurons do we need? A refined Analysis for Shallow Networks
trained with Gradient Descent [0.0]
We analyze the generalization properties of two-layer neural networks in the neural tangent kernel regime.
We derive fast rates of convergence that are known to be minimax optimal in the framework of non-parametric regression.
arXiv Detail & Related papers (2023-09-14T22:10:28Z) - Stabilizing RNN Gradients through Pre-training [3.335932527835653]
Theory of learning proposes to prevent the gradient from exponential growth with depth or time, to stabilize and improve training.
We extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution.
We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient.
arXiv Detail & Related papers (2023-08-23T11:48:35Z) - Generalization Guarantees of Gradient Descent for Multi-Layer Neural
Networks [55.86300309474023]
We conduct a comprehensive stability and generalization analysis of gradient descent (GD) for multi-layer NNs.
We derive the excess risk rate of $O(1/sqrtn)$ for GD algorithms in both two-layer and three-layer NNs.
arXiv Detail & Related papers (2023-05-26T12:51:38Z) - Neural Characteristic Activation Analysis and Geometric Parameterization for ReLU Networks [2.2713084727838115]
We introduce a novel approach for analyzing the training dynamics of ReLU networks by examining the characteristic activation boundaries of individual neurons.
Our proposed analysis reveals a critical instability in common neural network parameterizations and normalizations during convergence optimization, which impedes fast convergence and hurts performance.
arXiv Detail & Related papers (2023-05-25T10:19:13Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Convergence proof for stochastic gradient descent in the training of
deep neural networks with ReLU activation for constant target functions [1.7149364927872015]
gradient descent (SGD) type optimization methods perform very effectively in the training of deep neural networks (DNNs)
In this work we study SGD type optimization methods in the training of fully-connected feedforward DNNs with rectified linear unit (ReLU) activation.
arXiv Detail & Related papers (2021-12-13T11:45:36Z) - Stability & Generalisation of Gradient Descent for Shallow Neural
Networks without the Neural Tangent Kernel [19.4934492061353]
We prove new generalisation and excess risk bounds without the Neural Tangent Kernel (NTK) or Polyak-Lojasiewicz (PL) assumptions.
We show oracle type bounds which reveal that the generalisation and excess risk of GD is controlled by an interpolating network with the shortest GD path from initialisation.
Unlike most of the NTK-based analyses we focus on regression with label noise and show that GD with early stopping is consistent.
arXiv Detail & Related papers (2021-07-27T10:53:15Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.