FusionFormer: Fusing Operations in Transformer for Efficient Streaming
Speech Recognition
- URL: http://arxiv.org/abs/2210.17079v1
- Date: Mon, 31 Oct 2022 06:01:02 GMT
- Title: FusionFormer: Fusing Operations in Transformer for Efficient Streaming
Speech Recognition
- Authors: Xingchen Song, Di Wu, Binbin Zhang, Zhiyong Wu, Wenpeng Li, Dongfang
Li, Pengshen Zhang, Zhendong Peng, Fuping Pan, Changbao Zhu, Zhongqin Wu
- Abstract summary: Inherited from the Natural Language Processing (NLP) tasks, the architecture takes Layer Normalization(LN) as a default normalization technique.
LN might take 10% of the inference time despite that it only contributes to 0.1% of the FLOPs.
We propose to append a BN layer to each linear or convolution layer where stabilized training results are observed.
- Score: 15.408221924741298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently proposed Conformer architecture which combines convolution with
attention to capture both local and global dependencies has become the
\textit{de facto} backbone model for Automatic Speech Recognition~(ASR).
Inherited from the Natural Language Processing (NLP) tasks, the architecture
takes Layer Normalization~(LN) as a default normalization technique. However,
through a series of systematic studies, we find that LN might take 10\% of the
inference time despite that it only contributes to 0.1\% of the FLOPs. This
motivates us to replace LN with other normalization techniques, e.g., Batch
Normalization~(BN), to speed up inference with the help of operator fusion
methods and the avoidance of calculating the mean and variance statistics
during inference. After examining several plain attempts which directly remove
all LN layers or replace them with BN in the same place, we find that the
divergence issue is mainly caused by the unstable layer output. We therefore
propose to append a BN layer to each linear or convolution layer where
stabilized training results are observed. We also propose to simplify the
activations in Conformer, such as Swish and GLU, by replacing them with ReLU.
All these exchanged modules can be fused into the weights of the adjacent
linear/convolution layers and hence have zero inference cost. Therefore, we
name it FusionFormer. Our experiments indicate that FusionFormer is as
effective as the LN-based Conformer and is about 10\% faster.
Related papers
- Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - A Framework for Provably Stable and Consistent Training of Deep
Feedforward Networks [4.21061712600981]
We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios.
This algorithm combines the standard descent gradient and the gradient clipping method.
We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.
arXiv Detail & Related papers (2023-05-20T07:18:06Z) - Rethinking Normalization Methods in Federated Learning [92.25845185724424]
Federated learning (FL) is a popular distributed learning framework that can reduce privacy risks by not explicitly sharing private data.
We show that external covariate shifts will lead to the obliteration of some devices' contributions to the global model.
arXiv Detail & Related papers (2022-10-07T01:32:24Z) - Unified Normalization for Accelerating and Stabilizing Transformers [35.07454490355906]
Layer Normalization (LN) normalizes activations within each token to boost robustness.
LN requires on-the-fly statistics calculation in inference as well as division and square root operations.
We propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations.
arXiv Detail & Related papers (2022-08-02T08:41:31Z) - Distribution Mismatch Correction for Improved Robustness in Deep Neural
Networks [86.42889611784855]
normalization methods increase the vulnerability with respect to noise and input corruptions.
We propose an unsupervised non-parametric distribution correction method that adapts the activation distribution of each layer.
In our experiments, we empirically show that the proposed method effectively reduces the impact of intense image corruptions.
arXiv Detail & Related papers (2021-10-05T11:36:25Z) - Orthogonal Jacobian Regularization for Unsupervised Disentanglement in
Image Generation [64.92152574895111]
We propose a simple Orthogonal Jacobian Regularization (OroJaR) to encourage deep generative model to learn disentangled representations.
Our method is effective in disentangled and controllable image generation, and performs favorably against the state-of-the-art methods.
arXiv Detail & Related papers (2021-08-17T15:01:46Z) - Delving into Variance Transmission and Normalization: Shift of Average
Gradient Makes the Network Collapse [9.848051975417116]
We explain the effect of Batch Normalization (BN) from the perspective of variance transmission.
We propose Parametric Weights Standardization (PWS) to solve the shift of the average gradient.
PWS enables the network to converge fast without normalizing the outputs.
arXiv Detail & Related papers (2021-03-22T05:40:46Z) - MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch
Normalization [60.36100335878855]
We propose a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training.
We leverage the neural kernel (NTK) theory to prove that our weight mean operation whitens activations and transits network into the chaotic regime like BN layer.
MimicNorm achieves similar accuracy for various network structures, including ResNets and lightweight networks like ShuffleNet, with a reduction of about 20% memory consumption.
arXiv Detail & Related papers (2020-10-19T07:42:41Z) - PowerNorm: Rethinking Batch Normalization in Transformers [96.14956636022957]
normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN)
LN is preferred due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks.
We propose Power Normalization (PN), a novel normalization scheme that resolves this issue.
arXiv Detail & Related papers (2020-03-17T17:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.