On the Impact of Stable Ranks in Deep Nets
- URL: http://arxiv.org/abs/2110.02333v1
- Date: Tue, 5 Oct 2021 20:04:41 GMT
- Title: On the Impact of Stable Ranks in Deep Nets
- Authors: Bogdan Georgiev, Lukas Franken, Mayukh Mukherjee and Georgios
Arvanitidis
- Abstract summary: We show that stable ranks appear layerwise essentially as linear factors whose effect accumulates exponentially depthwise.
Our results imply that stable ranks appear layerwise essentially as linear factors whose effect accumulates exponentially depthwise.
- Score: 3.307203784120635
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A recent line of work has established intriguing connections between the
generalization/compression properties of a deep neural network (DNN) model and
the so-called layer weights' stable ranks. Intuitively, the latter are
indicators of the effective number of parameters in the net. In this work, we
address some natural questions regarding the space of DNNs conditioned on the
layers' stable rank, where we study feed-forward dynamics, initialization,
training and expressivity. To this end, we first propose a random DNN model
with a new sampling scheme based on stable rank. Then, we show how feed-forward
maps are affected by the constraint and how training evolves in the
overparametrized regime (via Neural Tangent Kernels). Our results imply that
stable ranks appear layerwise essentially as linear factors whose effect
accumulates exponentially depthwise. Moreover, we provide empirical analysis
suggesting that stable rank initialization alone can lead to convergence speed
ups.
Related papers
- Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse [32.06666853127924]
Deep neural networks (DNNs) at convergence consistently represent the training data in the last layer via a symmetric geometric structure referred to as neural collapse.
Here, the features of the penultimate layer are free variables, which makes the model data-agnostic and, hence, puts into question its ability to capture training.
We first prove generic guarantees on neural collapse that assume (i) low training error and balancedness of the linear layers, and (ii) bounded conditioning of the features before the linear part.
arXiv Detail & Related papers (2024-10-07T10:16:40Z) - Stabilizing RNN Gradients through Pre-training [3.335932527835653]
Theory of learning proposes to prevent the gradient from exponential growth with depth or time, to stabilize and improve training.
We extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution.
We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient.
arXiv Detail & Related papers (2023-08-23T11:48:35Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - An Adaptive and Stability-Promoting Layerwise Training Approach for Sparse Deep Neural Network Architecture [0.0]
This work presents a two-stage adaptive framework for developing deep neural network (DNN) architectures that generalize well for a given training data set.
In the first stage, a layerwise training approach is adopted where a new layer is added each time and trained independently by freezing parameters in the previous layers.
We introduce a epsilon-delta stability-promoting concept as a desirable property for a learning algorithm and show that employing manifold regularization yields a epsilon-delta stability-promoting algorithm.
arXiv Detail & Related papers (2022-11-13T09:51:16Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d)
This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of
DNNs [115.35745188028169]
We extend conditioning analysis to deep neural networks (DNNs) in order to investigate their learning dynamics.
We show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum.
We experimentally observe that BN can improve the layer-wise conditioning of the optimization problem.
arXiv Detail & Related papers (2020-02-25T11:40:27Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.