Delving into Variance Transmission and Normalization: Shift of Average
Gradient Makes the Network Collapse
- URL: http://arxiv.org/abs/2103.11590v1
- Date: Mon, 22 Mar 2021 05:40:46 GMT
- Title: Delving into Variance Transmission and Normalization: Shift of Average
Gradient Makes the Network Collapse
- Authors: Yuxiang Liu, Jidong Ge, Chuanyi Li, and Jie Gui
- Abstract summary: We explain the effect of Batch Normalization (BN) from the perspective of variance transmission.
We propose Parametric Weights Standardization (PWS) to solve the shift of the average gradient.
PWS enables the network to converge fast without normalizing the outputs.
- Score: 9.848051975417116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Normalization operations are essential for state-of-the-art neural networks
and enable us to train a network from scratch with a large learning rate (LR).
We attempt to explain the real effect of Batch Normalization (BN) from the
perspective of variance transmission by investigating the relationship between
BN and Weights Normalization (WN). In this work, we demonstrate that the
problem of the shift of the average gradient will amplify the variance of every
convolutional (conv) layer. We propose Parametric Weights Standardization
(PWS), a fast and robust to mini-batch size module used for conv filters, to
solve the shift of the average gradient. PWS can provide the speed-up of BN.
Besides, it has less computation and does not change the output of a conv
layer. PWS enables the network to converge fast without normalizing the
outputs. This result enhances the persuasiveness of the shift of the average
gradient and explains why BN works from the perspective of variance
transmission. The code and appendix will be made available on
https://github.com/lyxzzz/PWSConv.
Related papers
- Revisiting Data Augmentation for Rotational Invariance in Convolutional
Neural Networks [0.29127054707887967]
We investigate how best to include rotational invariance in a CNN for image classification.
Our experiments show that networks trained with data augmentation alone can classify rotated images nearly as well as in the normal unrotated case.
arXiv Detail & Related papers (2023-10-12T15:53:24Z) - Batch Layer Normalization, A new normalization layer for CNNs and RNN [0.0]
This study introduces a new normalization layer termed Batch Layer Normalization (BLN)
As a combined version of batch and layer normalization, BLN adaptively puts appropriate weight on mini-batch and feature normalization based on the inverse size of mini-batches.
Test results indicate the application potential of BLN and its faster convergence than batch normalization and layer normalization in both Convolutional and Recurrent Neural Networks.
arXiv Detail & Related papers (2022-09-19T10:12:51Z) - Network Pruning via Feature Shift Minimization [8.593369249204132]
We propose a novel Feature Shift Minimization (FSM) method to compress CNN models, which evaluates the feature shift by converging the information of both features and filters.
The proposed method yields state-of-the-art performance on various benchmark networks and datasets, verified by extensive experiments.
arXiv Detail & Related papers (2022-07-06T12:50:26Z) - B-cos Networks: Alignment is All We Need for Interpretability [136.27303006772294]
We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training.
A B-cos transform induces a single linear transform that faithfully summarises the full model computations.
We show that it can easily be integrated into common models such as VGGs, ResNets, InceptionNets, and DenseNets.
arXiv Detail & Related papers (2022-05-20T16:03:29Z) - Network Quantization with Element-wise Gradient Scaling [30.06895253269116]
Network quantization aims at reducing bit-widths of weights and/or activations.
Most methods use the straight-through estimator (STE) to train quantized networks.
We propose an element-wise gradient scaling (EWGS) to train a quantized network better than the STE.
arXiv Detail & Related papers (2021-04-02T06:34:53Z) - BN-invariant sharpness regularizes the training model to better
generalization [72.97766238317081]
We propose a measure of sharpness, BN-Sharpness, which gives consistent value for equivalent networks under BN.
We use the BN-sharpness to regularize the training and design an algorithm to minimize the new regularized objective.
arXiv Detail & Related papers (2021-01-08T10:23:24Z) - MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch
Normalization [60.36100335878855]
We propose a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training.
We leverage the neural kernel (NTK) theory to prove that our weight mean operation whitens activations and transits network into the chaotic regime like BN layer.
MimicNorm achieves similar accuracy for various network structures, including ResNets and lightweight networks like ShuffleNet, with a reduction of about 20% memory consumption.
arXiv Detail & Related papers (2020-10-19T07:42:41Z) - Double Forward Propagation for Memorized Batch Normalization [68.34268180871416]
Batch Normalization (BN) has been a standard component in designing deep neural networks (DNNs)
We propose a memorized batch normalization (MBN) which considers multiple recent batches to obtain more accurate and robust statistics.
Compared to related methods, the proposed MBN exhibits consistent behaviors in both training and inference.
arXiv Detail & Related papers (2020-10-10T08:48:41Z) - PowerNorm: Rethinking Batch Normalization in Transformers [96.14956636022957]
normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN)
LN is preferred due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks.
We propose Power Normalization (PN), a novel normalization scheme that resolves this issue.
arXiv Detail & Related papers (2020-03-17T17:50:26Z) - Embedding Propagation: Smoother Manifold for Few-Shot Classification [131.81692677836202]
We propose to use embedding propagation as an unsupervised non-parametric regularizer for manifold smoothing in few-shot classification.
We empirically show that embedding propagation yields a smoother embedding manifold.
We show that embedding propagation consistently improves the accuracy of the models in multiple semi-supervised learning scenarios by up to 16% points.
arXiv Detail & Related papers (2020-03-09T13:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.