Stochastic Attention Head Removal: A simple and effective method for
improving Transformer Based ASR Models
- URL: http://arxiv.org/abs/2011.04004v2
- Date: Tue, 6 Apr 2021 15:29:51 GMT
- Title: Stochastic Attention Head Removal: A simple and effective method for
improving Transformer Based ASR Models
- Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals
- Abstract summary: We propose to randomly remove attention heads during training and keep all attention heads at test time, thus the final model is an ensemble of models with different architectures.
Our method gives consistent performance gains over strong baselines on the Wall Street Journal, AISHELL, Switchboard and AMI datasets.
- Score: 40.991809705930955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Transformer based models have shown competitive automatic speech
recognition (ASR) performance. One key factor in the success of these models is
the multi-head attention mechanism. However, for trained models, we have
previously observed that many attention matrices are close to diagonal,
indicating the redundancy of the corresponding attention heads. We have also
found that some architectures with reduced numbers of attention heads have
better performance. Since the search for the best structure is time
prohibitive, we propose to randomly remove attention heads during training and
keep all attention heads at test time, thus the final model is an ensemble of
models with different architectures. The proposed method also forces each head
independently learn the most useful patterns. We apply the proposed method to
train Transformer based and Convolution-augmented Transformer (Conformer) based
ASR models. Our method gives consistent performance gains over strong baselines
on the Wall Street Journal, AISHELL, Switchboard and AMI datasets. To the best
of our knowledge, we have achieved state-of-the-art end-to-end Transformer
based model performance on Switchboard and AMI.
Related papers
- StochCA: A Novel Approach for Exploiting Pretrained Models with Cross-Attention [2.66269503676104]
We introduce a novel fine-tuning method, called cross-attention (StochCA), specific to Transformer architectures.
This method modifies the Transformer's self-attention mechanism to selectively utilize knowledge from pretrained models during fine-tuning.
Our experimental results show the superiority of StochCA over state-of-the-art approaches in both areas.
arXiv Detail & Related papers (2024-02-25T13:53:49Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Decision Transformer: Reinforcement Learning via Sequence Modeling [102.86873656751489]
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem.
We present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling.
Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
arXiv Detail & Related papers (2021-06-02T17:53:39Z) - Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z) - TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech
Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT)
The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z) - Scaling Local Self-Attention For Parameter Efficient Visual Backbones [29.396052798583234]
Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions.
We develop a new self-attention model family, emphHaloNets, which reach state-of-the-art accuracies on the parameter-limited setting of the ImageNet classification benchmark.
arXiv Detail & Related papers (2021-03-23T17:56:06Z) - Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning
with Self-Knowledge Distillation [11.52842516726486]
We propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers.
We also introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model.
With language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models.
arXiv Detail & Related papers (2021-03-17T21:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.