Unified Normalization for Accelerating and Stabilizing Transformers
- URL: http://arxiv.org/abs/2208.01313v1
- Date: Tue, 2 Aug 2022 08:41:31 GMT
- Title: Unified Normalization for Accelerating and Stabilizing Transformers
- Authors: Qiming Yang, Kai Zhang, Chaoxiang Lan, Zhi Yang, Zheyang Li, Wenming
Tan, Jun Xiao, Shiliang Pu
- Abstract summary: Layer Normalization (LN) normalizes activations within each token to boost robustness.
LN requires on-the-fly statistics calculation in inference as well as division and square root operations.
We propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations.
- Score: 35.07454490355906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Solid results from Transformers have made them prevailing architectures in
various natural language and vision tasks. As a default component in
Transformers, Layer Normalization (LN) normalizes activations within each token
to boost the robustness. However, LN requires on-the-fly statistics calculation
in inference as well as division and square root operations, leading to
inefficiency on hardware. What is more, replacing LN with other
hardware-efficient normalization schemes (e.g., Batch Normalization) results in
inferior performance, even collapse in training. We find that this dilemma is
caused by abnormal behaviors of activation statistics, including large
fluctuations over iterations and extreme outliers across layers. To tackle
these issues, we propose Unified Normalization (UN), which can speed up the
inference by being fused with other linear operations and achieve comparable
performance on par with LN. UN strives to boost performance by calibrating the
activation and gradient statistics with a tailored fluctuation smoothing
strategy. Meanwhile, an adaptive outlier filtration strategy is applied to
avoid collapse in training whose effectiveness is theoretically proved and
experimentally verified in this paper. We demonstrate that UN can be an
efficient drop-in alternative to LN by conducting extensive experiments on
language and vision tasks. Besides, we evaluate the efficiency of our method on
GPU. Transformers equipped with UN enjoy about 31% inference speedup and nearly
18% memory reduction. Code will be released at
https://github.com/hikvision-research/Unified-Normalization.
Related papers
- Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification [53.727688136434345]
Graph Neural Networks (GNNs) have shown superior performance in node classification.
We present Fast Graph Sharpness-Aware Minimization (FGSAM) that integrates the rapid training of Multi-Layer Perceptrons with the superior performance of GNNs.
Our proposed algorithm outperforms the standard SAM with lower computational costs in FSNC tasks.
arXiv Detail & Related papers (2024-10-22T09:33:29Z) - Does learning the right latent variables necessarily improve in-context learning? [13.828665019247444]
Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights.
In this paper, we investigate the effect of explicitly inferring task latents.
We find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance.
arXiv Detail & Related papers (2024-05-29T15:06:10Z) - On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability [34.43255978863601]
Several suggest that transformers learn a mesa-optimizer during autorere training.
We show that a stronger assumption related to the moments of data is the sufficient necessary condition that the learned mesa-optimizer can perform.
arXiv Detail & Related papers (2024-05-27T05:41:06Z) - Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z) - FusionFormer: Fusing Operations in Transformer for Efficient Streaming
Speech Recognition [15.408221924741298]
Inherited from the Natural Language Processing (NLP) tasks, the architecture takes Layer Normalization(LN) as a default normalization technique.
LN might take 10% of the inference time despite that it only contributes to 0.1% of the FLOPs.
We propose to append a BN layer to each linear or convolution layer where stabilized training results are observed.
arXiv Detail & Related papers (2022-10-31T06:01:02Z) - Rethinking Normalization Methods in Federated Learning [92.25845185724424]
Federated learning (FL) is a popular distributed learning framework that can reduce privacy risks by not explicitly sharing private data.
We show that external covariate shifts will lead to the obliteration of some devices' contributions to the global model.
arXiv Detail & Related papers (2022-10-07T01:32:24Z) - Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training.
We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z) - PowerNorm: Rethinking Batch Normalization in Transformers [96.14956636022957]
normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN)
LN is preferred due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks.
We propose Power Normalization (PN), a novel normalization scheme that resolves this issue.
arXiv Detail & Related papers (2020-03-17T17:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.