On the Impact of Stable Ranks in Deep Nets
- URL: http://arxiv.org/abs/2110.02333v1
- Date: Tue, 5 Oct 2021 20:04:41 GMT
- Title: On the Impact of Stable Ranks in Deep Nets
- Authors: Bogdan Georgiev, Lukas Franken, Mayukh Mukherjee and Georgios
Arvanitidis
- Abstract summary: We show that stable ranks appear layerwise essentially as linear factors whose effect accumulates exponentially depthwise.
Our results imply that stable ranks appear layerwise essentially as linear factors whose effect accumulates exponentially depthwise.
- Score: 3.307203784120635
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A recent line of work has established intriguing connections between the
generalization/compression properties of a deep neural network (DNN) model and
the so-called layer weights' stable ranks. Intuitively, the latter are
indicators of the effective number of parameters in the net. In this work, we
address some natural questions regarding the space of DNNs conditioned on the
layers' stable rank, where we study feed-forward dynamics, initialization,
training and expressivity. To this end, we first propose a random DNN model
with a new sampling scheme based on stable rank. Then, we show how feed-forward
maps are affected by the constraint and how training evolves in the
overparametrized regime (via Neural Tangent Kernels). Our results imply that
stable ranks appear layerwise essentially as linear factors whose effect
accumulates exponentially depthwise. Moreover, we provide empirical analysis
suggesting that stable rank initialization alone can lead to convergence speed
ups.
Related papers
- Stabilizing RNN Gradients through Pre-training [3.335932527835653]
Theory of learning proposes to prevent the gradient from exponential growth with depth or time, to stabilize and improve training.
We extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution.
We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient.
arXiv Detail & Related papers (2023-08-23T11:48:35Z) - On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural
Networks with Linear Activations [0.0]
We investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation.
We show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized.
arXiv Detail & Related papers (2023-05-17T02:26:34Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - Shallow Univariate ReLu Networks as Splines: Initialization, Loss
Surface, Hessian, & Gradient Flow Dynamics [1.5393457051344297]
We propose reparametrizing ReLU NNs as continuous piecewise linear splines.
We develop a surprisingly simple and transparent view of the structure of the loss surface, including its critical and fixed points, Hessian, and Hessian spectrum.
Videos of learning dynamics using a spline-based visualization are available at http://shorturl.at/tFWZ2.
arXiv Detail & Related papers (2020-08-04T19:19:49Z) - Modeling from Features: a Mean-field Framework for Over-parameterized
Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs)
In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit.
We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z) - An Ode to an ODE [78.97367880223254]
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where time-dependent parameters of the main flow evolve according to a matrix flow on the group O(d)
This nested system of two flows provides stability and effectiveness of training and provably solves the gradient vanishing-explosion problem.
arXiv Detail & Related papers (2020-06-19T22:05:19Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of
DNNs [115.35745188028169]
We extend conditioning analysis to deep neural networks (DNNs) in order to investigate their learning dynamics.
We show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum.
We experimentally observe that BN can improve the layer-wise conditioning of the optimization problem.
arXiv Detail & Related papers (2020-02-25T11:40:27Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.