On the impact of activation and normalization in obtaining isometric
embeddings at initialization
- URL: http://arxiv.org/abs/2305.18399v3
- Date: Fri, 17 Nov 2023 22:14:18 GMT
- Title: On the impact of activation and normalization in obtaining isometric
embeddings at initialization
- Authors: Amir Joudaki, Hadi Daneshmand, Francis Bach
- Abstract summary: We show that layer normalization biases the Gram matrix of a multilayer perceptron towards the identity matrix.
We quantify this rate using the Hermite expansion of the activation function.
- Score: 3.3637738618247157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we explore the structure of the penultimate Gram matrix in
deep neural networks, which contains the pairwise inner products of outputs
corresponding to a batch of inputs. In several architectures it has been
observed that this Gram matrix becomes degenerate with depth at initialization,
which dramatically slows training. Normalization layers, such as batch or layer
normalization, play a pivotal role in preventing the rank collapse issue.
Despite promising advances, the existing theoretical results do not extend to
layer normalization, which is widely used in transformers, and can not
quantitatively characterize the role of non-linear activations. To bridge this
gap, we prove that layer normalization, in conjunction with activation layers,
biases the Gram matrix of a multilayer perceptron towards the identity matrix
at an exponential rate with depth at initialization. We quantify this rate
using the Hermite expansion of the activation function.
Related papers
- Implicit Regularization of Gradient Flow on One-Layer Softmax Attention [10.060496091806694]
We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model.
Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices.
arXiv Detail & Related papers (2024-03-13T17:02:27Z) - Linearly Constrained Weights: Reducing Activation Shift for Faster Training of Neural Networks [1.7767466724342067]
We propose linearly constrained weights (LCW) to reduce the activation shift in both fully connected and convolutional layers.
LCW enables a deep feedforward network with sigmoid activation functions to be trained efficiently by resolving the vanishing gradient problem.
arXiv Detail & Related papers (2024-03-08T01:01:24Z) - Towards Training Without Depth Limits: Batch Normalization Without
Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth.
We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth.
We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models.
We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z) - Extreme Memorization via Scale of Initialization [72.78162454173803]
We construct an experimental setup in which changing the scale of initialization strongly impacts the implicit regularization induced by SGD.
We find that the extent and manner in which generalization ability is affected depends on the activation and loss function used.
In the case of the homogeneous ReLU activation, we show that this behavior can be attributed to the loss function.
arXiv Detail & Related papers (2020-08-31T04:53:11Z) - Generative Flows with Matrix Exponential [25.888286821451562]
Generative flows models enjoy the properties of tractable exact likelihood and efficient sampling.
We incorporate matrix exponential into generative flows.
Our model achieves great performance on density estimation amongst generative flows models.
arXiv Detail & Related papers (2020-07-19T11:18:47Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - On the Convex Behavior of Deep Neural Networks in Relation to the
Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between.
In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.