Towards Training Without Depth Limits: Batch Normalization Without
Gradient Explosion
- URL: http://arxiv.org/abs/2310.02012v1
- Date: Tue, 3 Oct 2023 12:35:02 GMT
- Title: Towards Training Without Depth Limits: Batch Normalization Without
Gradient Explosion
- Authors: Alexandru Meterez, Amir Joudaki, Francesco Orabona, Alexander Immer,
Gunnar R\"atsch, Hadi Daneshmand
- Abstract summary: We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth.
We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth.
We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
- Score: 83.90492831583997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Normalization layers are one of the key building blocks for deep neural
networks. Several theoretical studies have shown that batch normalization
improves the signal propagation, by avoiding the representations from becoming
collinear across the layers. However, results on mean-field theory of batch
normalization also conclude that this benefit comes at the expense of exploding
gradients in depth. Motivated by these two aspects of batch normalization, in
this study we pose the following question: "Can a batch-normalized network keep
the optimal signal propagation properties, but avoid exploding gradients?" We
answer this question in the affirmative by giving a particular construction of
an Multi-Layer Perceptron (MLP) with linear activations and batch-normalization
that provably has bounded gradients at any depth. Based on Weingarten calculus,
we develop a rigorous and non-asymptotic theory for this constructed MLP that
gives a precise characterization of forward signal propagation, while proving
that gradients remain bounded for linearly independent input samples, which
holds in most practical settings. Inspired by our theory, we also design an
activation shaping scheme that empirically achieves the same properties for
certain non-linear activations.
Related papers
- Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise.
Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise.
We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - Taming Gradient Oversmoothing and Expansion in Graph Neural Networks [3.0764244780817283]
Oversmoothing has been claimed as a primary bottleneck for graph neural networks (GNNs)
We show the presence of $textitgradient oversmoothing$ preventing optimization during training.
We provide a simple yet effective normalization method to prevent the gradient expansion.
arXiv Detail & Related papers (2024-10-07T08:22:20Z) - Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs [30.003409099607204]
We provide a formal and precise characterization of (linearized) graph neural networks (GNNs) with residual connections and normalization layers.
We show that the centering step of a normalization layer alters the graph signal in message-passing in such a way that relevant information can become harder to extract.
We introduce a novel, principled normalization layer called GraphNormv2 in which the centering step is learned such that it does not distort the original graph signal in an undesirable way.
arXiv Detail & Related papers (2024-06-05T06:53:16Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Training invariances and the low-rank phenomenon: beyond linear networks [44.02161831977037]
We show that when one trains a deep linear network with logistic or exponential loss on linearly separable data, the weights converge to rank-$1$ matrices.
This is the first time a low-rank phenomenon is proven rigorously for nonlinear ReLU-activated feedforward networks.
Our proof relies on a specific decomposition of the network into a multilinear function and another ReLU network whose weights are constant under a certain parameter directional convergence.
arXiv Detail & Related papers (2022-01-28T07:31:19Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Directional convergence and alignment in deep learning [38.73942298289583]
We show that although the minimizers of cross-entropy and related classification losses at infinity, network weights learn by gradient flow converge in direction.
This proof holds for deep homogeneous networks allowing for ReLU, max-pooling, linear, and convolutional layers.
arXiv Detail & Related papers (2020-06-11T17:50:11Z) - Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of
DNNs [115.35745188028169]
We extend conditioning analysis to deep neural networks (DNNs) in order to investigate their learning dynamics.
We show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum.
We experimentally observe that BN can improve the layer-wise conditioning of the optimization problem.
arXiv Detail & Related papers (2020-02-25T11:40:27Z) - On the Principle of Least Symmetry Breaking in Shallow ReLU Models [13.760721677322072]
We show that the emphleast loss of symmetry with respect to the target weights may apply to a broader range of settings.
Motivated by this, we conduct a series of experiments which corroborate this hypothesis for different classes of non-isotropic non-product distributions, smooth activation functions and networks with a few layers.
arXiv Detail & Related papers (2019-12-26T22:04:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.