Batch Normalization Orthogonalizes Representations in Deep Random
Networks
- URL: http://arxiv.org/abs/2106.03970v1
- Date: Mon, 7 Jun 2021 21:14:59 GMT
- Title: Batch Normalization Orthogonalizes Representations in Deep Random
Networks
- Authors: Hadi Daneshmand, Amir Joudaki, Francis Bach
- Abstract summary: We establish a non-asymptotic characterization of the interplay between depth, width, and the orthogonality of deep representations.
We prove that the deviation of the representations from orthogonality rapidly decays with depth up to a term inversely proportional to the network width.
This result has two main implications: 1) Theoretically, as the depth grows, the distribution of the representation contracts to a Wasserstein-2 ball around an isotropic Gaussian distribution.
- Score: 3.109481609083199
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper underlines a subtle property of batch-normalization (BN):
Successive batch normalizations with random linear transformations make hidden
representations increasingly orthogonal across layers of a deep neural network.
We establish a non-asymptotic characterization of the interplay between depth,
width, and the orthogonality of deep representations. More precisely, under a
mild assumption, we prove that the deviation of the representations from
orthogonality rapidly decays with depth up to a term inversely proportional to
the network width. This result has two main implications: 1) Theoretically, as
the depth grows, the distribution of the representation -- after the linear
layers -- contracts to a Wasserstein-2 ball around an isotropic Gaussian
distribution. Furthermore, the radius of this Wasserstein ball shrinks with the
width of the network. 2) In practice, the orthogonality of the representations
directly influences the performance of stochastic gradient descent (SGD). When
representations are initially aligned, we observe SGD wastes many iterations to
orthogonalize representations before the classification. Nevertheless, we
experimentally show that starting optimization from orthogonal representations
is sufficient to accelerate SGD, with no need for BN.
Related papers
- Feature Learning and Generalization in Deep Networks with Orthogonal Weights [1.7956122940209063]
Deep neural networks with numerically weights from independent Gaussian distributions can be tuned to criticality.
These networks still exhibit fluctuations that grow linearly with the depth of the network.
We show analytically that rectangular networks with tanh activations and weights from the ensemble of matrices have corresponding preactivation fluctuations.
arXiv Detail & Related papers (2023-10-11T18:00:02Z) - Law of Balance and Stationary Distribution of Stochastic Gradient
Descent [11.937085301750288]
We prove that the minibatch noise of gradient descent (SGD) regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry.
We then derive the stationary distribution of gradient flow for a diagonal linear network with arbitrary depth and width.
These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.
arXiv Detail & Related papers (2023-08-13T03:13:03Z) - Bayesian Interpolation with Deep Linear Networks [92.1721532941863]
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory.
We show that linear networks make provably optimal predictions at infinite depth.
We also show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth.
arXiv Detail & Related papers (2022-12-29T20:57:46Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Vanishing Curvature and the Power of Adaptive Methods in Randomly
Initialized Deep Networks [30.467121747150816]
This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly neural networks.
We first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth)
arXiv Detail & Related papers (2021-06-07T16:29:59Z) - A Convergence Theory Towards Practical Over-parameterized Deep Neural
Networks [56.084798078072396]
We take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time.
We show that convergence to a global minimum is guaranteed for networks with quadratic widths in the sample size and linear in their depth at a time logarithmic in both.
Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size.
arXiv Detail & Related papers (2021-01-12T00:40:45Z) - On the Convex Behavior of Deep Neural Networks in Relation to the
Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between.
In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.