Related papers: Vanishing Curvature and the Power of Adaptive Methods in Randomly Initialized Deep Networks

Vanishing Curvature and the Power of Adaptive Methods in Randomly Initialized Deep Networks

URL: http://arxiv.org/abs/2106.03763v1
Date: Mon, 7 Jun 2021 16:29:59 GMT
Title: Vanishing Curvature and the Power of Adaptive Methods in Randomly Initialized Deep Networks
Authors: Antonio Orvieto, Jonas Kohler, Dario Pavllo, Thomas Hofmann, Aurelien Lucchi
Abstract summary: This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly neural networks. We first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth)
Score: 30.467121747150816
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly initialized neural networks. Leveraging an in-depth analysis of neural chains, we first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth), even when initialized with the popular Xavier and He initializations. Second, we extend the analysis to second-order derivatives and show that random i.i.d. initialization also gives rise to Hessian matrices with eigenspectra that vanish as networks grow in depth. Whenever this happens, optimizers are initialized in a very flat, saddle point-like plateau, which is particularly hard to escape with stochastic gradient descent (SGD) as its escaping time is inversely related to curvature. We believe that this observation is crucial for fully understanding (a) historical difficulties of training deep nets with vanilla SGD, (b) the success of adaptive gradient methods (which naturally adapt to curvature and thus quickly escape flat plateaus) and (c) the effectiveness of modern architectural components like residual connections and normalization layers.

Related papers

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Implicit Sparse Regularization: The Impact of Depth and Early Stopping [35.4113861165802]
We show that early stopping is crucial for gradient descent to converge to a sparse model. We characterize the impact of depth and early stopping and show that for a general depth parameter N, gradient descent with early stopping achieves minimax optimal sparse recovery.
arXiv Detail & Related papers (2021-08-12T07:43:29Z)
Continuous vs. Discrete Optimization of Deep Neural Networks [15.508460240818575]
We show that over deep neural networks with homogeneous activations, gradient flow trajectories enjoy favorable curvature. This finding allows us to translate an analysis of gradient flow over deep linear neural networks into a guarantee that gradient descent efficiently converges to global minimum. We hypothesize that the theory of gradient flows will be central to unraveling mysteries behind deep learning.
arXiv Detail & Related papers (2021-07-14T10:59:57Z)
On the Explicit Role of Initialization on the Convergence and Implicit Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow. We show that the squared loss converges exponentially to its optimum. We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z)
On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent [55.96478231566129]
We show that relative scales play an important role in determining the learned model. We develop a technique for deriving the inductive bias of gradient-flow.
arXiv Detail & Related papers (2021-02-19T07:10:48Z)
Inductive Bias of Gradient Descent for Exponentially Weight Normalized Smooth Homogeneous Neural Nets [1.7259824817932292]
We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. This paper shows that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate.
arXiv Detail & Related papers (2020-10-24T14:34:56Z)
Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix. Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
The Break-Even Point on Optimization Trajectories of Deep Neural Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory. We show that using a large learning rate in the initial phase of training reduces the variance of the gradient. We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between. In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.