Understanding the Failure of Batch Normalization for Transformers in NLP
- URL: http://arxiv.org/abs/2210.05153v1
- Date: Tue, 11 Oct 2022 05:18:47 GMT
- Title: Understanding the Failure of Batch Normalization for Transformers in NLP
- Authors: Jiaxi Wang, Ji Wu, Lei Huang
- Abstract summary: Batch Normalization (BN) is a technique to accelerate the training of deep neural networks.
BN fails to defend its position in Natural Language Processing (NLP), which is dominated by Layer Normalization (LN)
Regularized BN (RBN) improves the performance of BN consistently and outperforms or is on par with LN on 17 out of 20 settings.
- Score: 16.476194435004732
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Batch Normalization (BN) is a core and prevalent technique in accelerating
the training of deep neural networks and improving the generalization on
Computer Vision (CV) tasks. However, it fails to defend its position in Natural
Language Processing (NLP), which is dominated by Layer Normalization (LN). In
this paper, we are trying to answer why BN usually performs worse than LN in
NLP tasks with Transformer models. We find that the inconsistency between
training and inference of BN is the leading cause that results in the failure
of BN in NLP. We define Training Inference Discrepancy (TID) to quantitatively
measure this inconsistency and reveal that TID can indicate BN's performance,
supported by extensive experiments, including image classification, neural
machine translation, language modeling, sequence labeling, and text
classification tasks. We find that BN can obtain much better test performance
than LN when TID keeps small through training. To suppress the explosion of
TID, we propose Regularized BN (RBN) that adds a simple regularization term to
narrow the gap between batch statistics and population statistics of BN. RBN
improves the performance of BN consistently and outperforms or is on par with
LN on 17 out of 20 settings, involving ten datasets and two common variants of
Transformer\footnote{Our code is available at
\url{https://github.com/wjxts/RegularizedBN}}.
Related papers
- Overcoming Recency Bias of Normalization Statistics in Continual
Learning: Balance and Adaptation [67.77048565738728]
Continual learning involves learning a sequence of tasks and balancing their knowledge appropriately.
We propose Adaptive Balance of BN (AdaB$2$N), which incorporates appropriately a Bayesian-based strategy to adapt task-wise contributions.
Our approach achieves significant performance gains across a wide range of benchmarks.
arXiv Detail & Related papers (2023-10-13T04:50:40Z) - Diagnosing Batch Normalization in Class Incremental Learning [39.70552266952221]
Batch normalization (BN) standardizes intermediate feature maps and has been widely validated to improve training stability and convergence.
We propose BN Tricks to address the issue by training a better feature extractor while eliminating classification bias.
We show that BN Tricks can bring significant performance gains to all adopted baselines.
arXiv Detail & Related papers (2022-02-16T12:38:43Z) - Rebalancing Batch Normalization for Exemplar-based Class-Incremental
Learning [23.621259845287824]
Batch Normalization (BN) has been extensively studied for neural nets in various computer vision tasks.
We develop a new update patch for BN, particularly tailored for the exemplar-based class-incremental learning (CIL)
arXiv Detail & Related papers (2022-01-29T11:03:03Z) - Batch Normalization Preconditioning for Neural Network Training [7.709342743709842]
Batch normalization (BN) is a popular and ubiquitous method in deep learning.
BN is not suitable for use with very small mini-batch sizes or online learning.
We propose a new method called Batch Normalization Preconditioning (BNP)
arXiv Detail & Related papers (2021-08-02T18:17:26Z) - "BNN - BN = ?": Training Binary Neural Networks without Batch
Normalization [92.23297927690149]
Batch normalization (BN) is a key facilitator and considered essential for state-of-the-art binary neural networks (BNN)
We extend their framework to training BNNs, and for the first time demonstrate that BNs can be completed removed from BNN training and inference regimes.
arXiv Detail & Related papers (2021-04-16T16:46:57Z) - MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch
Normalization [60.36100335878855]
We propose a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training.
We leverage the neural kernel (NTK) theory to prove that our weight mean operation whitens activations and transits network into the chaotic regime like BN layer.
MimicNorm achieves similar accuracy for various network structures, including ResNets and lightweight networks like ShuffleNet, with a reduction of about 20% memory consumption.
arXiv Detail & Related papers (2020-10-19T07:42:41Z) - Double Forward Propagation for Memorized Batch Normalization [68.34268180871416]
Batch Normalization (BN) has been a standard component in designing deep neural networks (DNNs)
We propose a memorized batch normalization (MBN) which considers multiple recent batches to obtain more accurate and robust statistics.
Compared to related methods, the proposed MBN exhibits consistent behaviors in both training and inference.
arXiv Detail & Related papers (2020-10-10T08:48:41Z) - PowerNorm: Rethinking Batch Normalization in Transformers [96.14956636022957]
normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN)
LN is preferred due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks.
We propose Power Normalization (PN), a novel normalization scheme that resolves this issue.
arXiv Detail & Related papers (2020-03-17T17:50:26Z) - Towards Stabilizing Batch Statistics in Backward Propagation of Batch
Normalization [126.6252371899064]
Moving Average Batch Normalization (MABN) is a novel normalization method.
We show that MABN can completely restore the performance of vanilla BN in small batch cases.
Our experiments demonstrate the effectiveness of MABN in multiple computer vision tasks including ImageNet and COCO.
arXiv Detail & Related papers (2020-01-19T14:41:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.