Convergence of Gradient Descent for Recurrent Neural Networks: A
Nonasymptotic Analysis
- URL: http://arxiv.org/abs/2402.12241v1
- Date: Mon, 19 Feb 2024 15:56:43 GMT
- Title: Convergence of Gradient Descent for Recurrent Neural Networks: A
Nonasymptotic Analysis
- Authors: Semih Cayci, Atilla Eryilmaz
- Abstract summary: We analyze recurrent neural networks trained with gradient descent in the supervised learning setting for dynamical systems.
We show that an appropriately-d recurrent neural network trained with $n$ samples can achieve optimality with a network size $m$ that scales only logarithmically with $n$.
- Score: 19.95757894913852
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We analyze recurrent neural networks trained with gradient descent in the
supervised learning setting for dynamical systems, and prove that gradient
descent can achieve optimality \emph{without} massive overparameterization. Our
in-depth nonasymptotic analysis (i) provides sharp bounds on the network size
$m$ and iteration complexity $\tau$ in terms of the sequence length $T$, sample
size $n$ and ambient dimension $d$, and (ii) identifies the significant impact
of long-term dependencies in the dynamical system on the convergence and
network width bounds characterized by a cutoff point that depends on the
Lipschitz continuity of the activation function. Remarkably, this analysis
reveals that an appropriately-initialized recurrent neural network trained with
$n$ samples can achieve optimality with a network size $m$ that scales only
logarithmically with $n$. This sharply contrasts with the prior works that
require high-order polynomial dependency of $m$ on $n$ to establish strong
regularity conditions. Our results are based on an explicit characterization of
the class of dynamical systems that can be approximated and learned by
recurrent neural networks via norm-constrained transportation mappings, and
establishing local smoothness properties of the hidden state with respect to
the learnable parameters.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimiax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint [5.9954962391837885]
We study the gradient descent dynamics of neural networks through the lens of macroscopic limits.
Our study reveals that gradient descent can rapidly drive deep neural networks to zero training loss.
Our approach draws inspiration from the Neural Tangent Kernel (NTK) paradigm.
arXiv Detail & Related papers (2024-04-07T08:07:02Z) - On Excess Risk Convergence Rates of Neural Network Classifiers [8.329456268842227]
We study the performance of plug-in classifiers based on neural networks in a binary classification setting as measured by their excess risks.
We analyze the estimation and approximation properties of neural networks to obtain a dimension-free, uniform rate of convergence.
arXiv Detail & Related papers (2023-09-26T17:14:10Z) - Speed Limits for Deep Learning [67.69149326107103]
Recent advancement in thermodynamics allows bounding the speed at which one can go from the initial weight distribution to the final distribution of the fully trained network.
We provide analytical expressions for these speed limits for linear and linearizable neural networks.
Remarkably, given some plausible scaling assumptions on the NTK spectra and spectral decomposition of the labels -- learning is optimal in a scaling sense.
arXiv Detail & Related papers (2023-07-27T06:59:46Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Excess risk bound for deep learning under weak dependence [0.0]
This paper considers deep neural networks for learning weakly dependent processes.
We derive the required depth, width and sparsity of a deep neural network to approximate any H"older smooth function.
arXiv Detail & Related papers (2023-02-15T07:23:48Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Critical Initialization of Wide and Deep Neural Networks through Partial
Jacobians: General Theory and Applications [6.579523168465526]
We introduce emphpartial Jacobians of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0leq l$.
We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections.
arXiv Detail & Related papers (2021-11-23T20:31:42Z) - Measuring Model Complexity of Neural Networks with Curve Activation
Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function.
We experimentally explore the training process of neural networks and detect overfitting.
We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.