Stabilizing RNN Gradients through Pre-training
- URL: http://arxiv.org/abs/2308.12075v2
- Date: Fri, 5 Jan 2024 00:56:41 GMT
- Title: Stabilizing RNN Gradients through Pre-training
- Authors: Luca Herranz-Celotti, Jean Rouat
- Abstract summary: Theory of learning proposes to prevent the gradient from exponential growth with depth or time, to stabilize and improve training.
We extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution.
We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient.
- Score: 3.335932527835653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numerous theories of learning propose to prevent the gradient from
exponential growth with depth or time, to stabilize and improve training.
Typically, these analyses are conducted on feed-forward fully-connected neural
networks or simple single-layer recurrent neural networks, given their
mathematical tractability. In contrast, this study demonstrates that
pre-training the network to local stability can be effective whenever the
architectures are too complex for an analytical initialization. Furthermore, we
extend known stability theories to encompass a broader family of deep recurrent
networks, requiring minimal assumptions on data and parameter distribution, a
theory we call the Local Stability Condition (LSC). Our investigation reveals
that the classical Glorot, He, and Orthogonal initialization schemes satisfy
the LSC when applied to feed-forward fully-connected neural networks. However,
analysing deep recurrent networks, we identify a new additive source of
exponential explosion that emerges from counting gradient paths in a
rectangular grid in depth and time. We propose a new approach to mitigate this
issue, that consists on giving a weight of a half to the time and depth
contributions to the gradient, instead of the classical weight of one. Our
empirical results confirm that pre-training both feed-forward and recurrent
networks, for differentiable, neuromorphic and state-space models to fulfill
the LSC, often results in improved final performance. This study contributes to
the field by providing a means to stabilize networks of any complexity. Our
approach can be implemented as an additional step before pre-training on large
augmented datasets, and as an alternative to finding stable initializations
analytically.
Related papers
- Implicit regularization of deep residual networks towards neural ODEs [8.075122862553359]
We establish an implicit regularization of deep residual networks towards neural ODEs.
We prove that if the network is as a discretization of a neural ODE, then such a discretization holds throughout training.
arXiv Detail & Related papers (2023-09-03T16:35:59Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Stability and Generalization Analysis of Gradient Methods for Shallow
Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability.
We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.