Related papers: Stochastic Attention Head Removal: A simple and effective method for improving Transformer Based ASR Models

Stochastic Attention Head Removal: A simple and effective method for improving Transformer Based ASR Models

URL: http://arxiv.org/abs/2011.04004v2
Date: Tue, 6 Apr 2021 15:29:51 GMT
Title: Stochastic Attention Head Removal: A simple and effective method for improving Transformer Based ASR Models
Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals
Abstract summary: We propose to randomly remove attention heads during training and keep all attention heads at test time, thus the final model is an ensemble of models with different architectures. Our method gives consistent performance gains over strong baselines on the Wall Street Journal, AISHELL, Switchboard and AMI datasets.
Score: 40.991809705930955
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Transformer based models have shown competitive automatic speech recognition (ASR) performance. One key factor in the success of these models is the multi-head attention mechanism. However, for trained models, we have previously observed that many attention matrices are close to diagonal, indicating the redundancy of the corresponding attention heads. We have also found that some architectures with reduced numbers of attention heads have better performance. Since the search for the best structure is time prohibitive, we propose to randomly remove attention heads during training and keep all attention heads at test time, thus the final model is an ensemble of models with different architectures. The proposed method also forces each head independently learn the most useful patterns. We apply the proposed method to train Transformer based and Convolution-augmented Transformer (Conformer) based ASR models. Our method gives consistent performance gains over strong baselines on the Wall Street Journal, AISHELL, Switchboard and AMI datasets. To the best of our knowledge, we have achieved state-of-the-art end-to-end Transformer based model performance on Switchboard and AMI.

Related papers

State Tuning: State-based Test-Time Scaling on RWKV-7 [0.747193191854175]
We introduce a novel state-based approach to test-time scaling, which we term state tuning, tailored to the RNN-based RWKV-7 model. By exploiting the unique strengths of RWKV-7, our method achieves state-of-the-art performance on the target task without altering the model's pre-trained weights.
arXiv Detail & Related papers (2025-04-07T14:04:30Z)
Learning Transformer-based World Models with Contrastive Predictive Coding [58.0159270859475]
We show that the next state prediction objective is insufficient to fully exploit the representation capabilities of Transformers. We propose to extend world model predictions to longer time horizons by introducing TWISTER, a world model using action-conditioned Contrastive Predictive Coding. TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search.
arXiv Detail & Related papers (2025-03-06T13:18:37Z)
ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition [5.311735227179715]
We explore and devise a novel ConvMixFormer architecture for dynamic hand gestures. The proposed method is evaluated on NVidia Dynamic Hand Gesture and Briareo datasets. Our model has achieved state-of-the-art results on single and multimodal inputs.
arXiv Detail & Related papers (2024-11-11T16:45:18Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z)
Decision Transformer: Reinforcement Learning via Sequence Modeling [102.86873656751489]
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. We present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
arXiv Detail & Related papers (2021-06-02T17:53:39Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)
TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step. To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT) The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z)
Scaling Local Self-Attention For Parameter Efficient Visual Backbones [29.396052798583234]
Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions. We develop a new self-attention model family, emphHaloNets, which reach state-of-the-art accuracies on the parameter-limited setting of the ImageNet classification benchmark.
arXiv Detail & Related papers (2021-03-23T17:56:06Z)
Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation [11.52842516726486]
We propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers. We also introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model. With language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models.
arXiv Detail & Related papers (2021-03-17T21:02:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.