Backward Gradient Normalization in Deep Neural Networks
- URL: http://arxiv.org/abs/2106.09475v1
- Date: Thu, 17 Jun 2021 13:24:43 GMT
- Title: Backward Gradient Normalization in Deep Neural Networks
- Authors: Alejandro Cabana and Luis F. Lago-Fern\'andez
- Abstract summary: We introduce a new technique for gradient normalization during neural network training.
The gradients are rescaled during the backward pass using normalization layers introduced at certain points within the network architecture.
Results on tests with very deep neural networks show that the new technique can do an effective control of the gradient norm.
- Score: 68.8204255655161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new technique for gradient normalization during neural network
training. The gradients are rescaled during the backward pass using
normalization layers introduced at certain points within the network
architecture. These normalization nodes do not affect forward activity
propagation, but modify backpropagation equations to permit a well-scaled
gradient flow that reaches the deepest network layers without experimenting
vanishing or explosion. Results on tests with very deep neural networks show
that the new technique can do an effective control of the gradient norm,
allowing the update of weights in the deepest layers and improving network
accuracy on several experimental conditions.
Related papers
- Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks [0.2302001830524133]
We instantiate a regularized form of the clipping gradient algorithm and prove that it can converge to the global minima of deep neural network loss functions.
We present empirical evidence that our theoretically founded regularized gradient clipping algorithm is also competitive with the state-of-the-art deep-learnings.
arXiv Detail & Related papers (2024-04-12T17:37:42Z) - Sensitivity-Based Layer Insertion for Residual and Feedforward Neural
Networks [0.3831327965422187]
Training of neural networks requires tedious and often manual tuning of the network architecture.
We propose a systematic method to insert new layers during the training process, which eliminates the need to choose a fixed network size before training.
arXiv Detail & Related papers (2023-11-27T16:44:13Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Gradient-trained Weights in Wide Neural Networks Align Layerwise to
Error-scaled Input Correlations [11.176824373696324]
We derive the layerwise weight dynamics of infinite-width neural networks with nonlinear activations trained by gradient descent.
We formulate backpropagation-free learning rules, named Align-zero and Align-ada, that theoretically achieve the same alignment as backpropagation.
arXiv Detail & Related papers (2021-06-15T21:56:38Z) - Rethinking Skip Connection with Layer Normalization in Transformers and
ResNets [49.87919454950763]
Skip connection is a widely-used technique to improve the performance of deep neural networks.
In this work, we investigate how the scale factors in the effectiveness of the skip connection.
arXiv Detail & Related papers (2021-05-15T11:44:49Z) - Optimization Theory for ReLU Neural Networks Trained with Normalization
Layers [82.61117235807606]
The success of deep neural networks in part due to the use of normalization layers.
Our analysis shows how the introduction of normalization changes the landscape and can enable faster activation.
arXiv Detail & Related papers (2020-06-11T23:55:54Z) - Beyond Dropout: Feature Map Distortion to Regularize Deep Neural
Networks [107.77595511218429]
In this paper, we investigate the empirical Rademacher complexity related to intermediate layers of deep neural networks.
We propose a feature distortion method (Disout) for addressing the aforementioned problem.
The superiority of the proposed feature map distortion for producing deep neural network with higher testing performance is analyzed and demonstrated.
arXiv Detail & Related papers (2020-02-23T13:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.