Training Deep Neural Networks Without Batch Normalization
- URL: http://arxiv.org/abs/2008.07970v1
- Date: Tue, 18 Aug 2020 15:04:40 GMT
- Title: Training Deep Neural Networks Without Batch Normalization
- Authors: Divya Gaur, Joachim Folz, and Andreas Dengel
- Abstract summary: This work studies batch normalization in detail, while comparing it with other methods such as weight normalization, gradient clipping and dropout.
The main purpose of this work is to determine if it is possible to train networks effectively when batch normalization is removed through adaption of the training process.
- Score: 4.266320191208303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training neural networks is an optimization problem, and finding a decent set
of parameters through gradient descent can be a difficult task. A host of
techniques has been developed to aid this process before and during the
training phase. One of the most important and widely used class of method is
normalization. It is generally favorable for neurons to receive inputs that are
distributed with zero mean and unit variance, so we use statistics about
dataset to normalize them before the first layer. However, this property cannot
be guaranteed for the intermediate activations inside the network. A widely
used method to enforce this property inside the network is batch normalization.
It was developed to combat covariate shift inside networks. Empirically it is
known to work, but there is a lack of theoretical understanding about its
effectiveness and potential drawbacks it might have when used in practice. This
work studies batch normalization in detail, while comparing it with other
methods such as weight normalization, gradient clipping and dropout. The main
purpose of this work is to determine if it is possible to train networks
effectively when batch normalization is removed through adaption of the
training process.
Related papers
- Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Training Thinner and Deeper Neural Networks: Jumpstart Regularization [2.8348950186890467]
We use regularization to prevent neurons from dying or becoming linear.
In comparison to conventional training, we obtain neural networks that are thinner, deeper, and - most importantly - more parameter-efficient.
arXiv Detail & Related papers (2022-01-30T12:11:24Z) - Compare Where It Matters: Using Layer-Wise Regularization To Improve
Federated Learning on Heterogeneous Data [0.0]
Federated Learning is a widely adopted method to train neural networks over distributed data.
One main limitation is the performance degradation that occurs when data is heterogeneously distributed.
We present FedCKA: a framework that out-performs previous state-of-the-art methods on various deep learning tasks.
arXiv Detail & Related papers (2021-12-01T10:46:13Z) - Distribution Mismatch Correction for Improved Robustness in Deep Neural
Networks [86.42889611784855]
normalization methods increase the vulnerability with respect to noise and input corruptions.
We propose an unsupervised non-parametric distribution correction method that adapts the activation distribution of each layer.
In our experiments, we empirically show that the proposed method effectively reduces the impact of intense image corruptions.
arXiv Detail & Related papers (2021-10-05T11:36:25Z) - Comparing Normalization Methods for Limited Batch Size Segmentation
Neural Networks [0.0]
Batch Normalization works best using large batch size during training.
We show the effectiveness of Instance Normalization in the limited batch size neural network training environment.
We also show that the Instance Normalization implementation used in this experiment is computational time efficient when compared to the network without any normalization method.
arXiv Detail & Related papers (2020-11-23T17:13:24Z) - Weight and Gradient Centralization in Deep Neural Networks [13.481518628796692]
Batch normalization is currently the most widely used variant of internal normalization for deep neural networks.
In this work, we combine several of these methods and thereby increase the generalization of the networks.
arXiv Detail & Related papers (2020-10-02T08:50:04Z) - Optimization Theory for ReLU Neural Networks Trained with Normalization
Layers [82.61117235807606]
The success of deep neural networks in part due to the use of normalization layers.
Our analysis shows how the introduction of normalization changes the landscape and can enable faster activation.
arXiv Detail & Related papers (2020-06-11T23:55:54Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - Gradient Centralization: A New Optimization Technique for Deep Neural
Networks [74.935141515523]
gradient centralization (GC) operates directly on gradients by centralizing the gradient vectors to have zero mean.
GC can be viewed as a projected gradient descent method with a constrained loss function.
GC is very simple to implement and can be easily embedded into existing gradient based DNNs with only one line of code.
arXiv Detail & Related papers (2020-04-03T10:25:00Z) - Side-Tuning: A Baseline for Network Adaptation via Additive Side
Networks [95.51368472949308]
Adaptation can be useful in cases when training data is scarce, or when one wishes to encode priors in the network.
In this paper, we propose a straightforward alternative: side-tuning.
arXiv Detail & Related papers (2019-12-31T18:52:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.