Analysis of Self-Attention Head Diversity for Conformer-based Automatic
Speech Recognition
- URL: http://arxiv.org/abs/2209.06096v1
- Date: Tue, 13 Sep 2022 15:50:03 GMT
- Title: Analysis of Self-Attention Head Diversity for Conformer-based Automatic
Speech Recognition
- Authors: Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno
- Abstract summary: We investigate approaches to increasing attention head diversity.
We show that introducing diversity-promoting auxiliary loss functions during training is a more effective approach.
Finally, we draw a connection between the diversity of attention heads and the similarity of the gradients of head parameters.
- Score: 36.53453860656191
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention layers are an integral part of modern end-to-end automatic speech
recognition systems, for instance as part of the Transformer or Conformer
architecture. Attention is typically multi-headed, where each head has an
independent set of learned parameters and operates on the same input feature
sequence. The output of multi-headed attention is a fusion of the outputs from
the individual heads. We empirically analyze the diversity between
representations produced by the different attention heads and demonstrate that
the heads become highly correlated during the course of training. We
investigate a few approaches to increasing attention head diversity, including
using different attention mechanisms for each head and auxiliary training loss
functions to promote head diversity. We show that introducing
diversity-promoting auxiliary loss functions during training is a more
effective approach, and obtain WER improvements of up to 6% relative on the
Librispeech corpus. Finally, we draw a connection between the diversity of
attention heads and the similarity of the gradients of head parameters.
Related papers
- Picking the Underused Heads: A Network Pruning Perspective of Attention
Head Selection for Fusing Dialogue Coreference Information [50.41829484199252]
Transformer-based models with the multi-head self-attention mechanism are widely used in natural language processing.
We investigate the attention head selection and manipulation strategy for feature injection from a network pruning perspective.
arXiv Detail & Related papers (2023-12-15T05:27:24Z) - Finding the Pillars of Strength for Multi-Head Attention [35.556186723898485]
Recent studies have revealed some issues of Multi-Head Attention (MHA)
We propose Grouped Head Attention, trained with a self-supervised group constraint that group attention heads.
We additionally propose a Voting-to-Stay procedure to remove redundant heads, thus achieving a transformer with lighter weights.
arXiv Detail & Related papers (2023-05-22T03:44:44Z) - Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head.
It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention.
On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z) - A Dynamic Head Importance Computation Mechanism for Neural Machine
Translation [22.784419165117512]
Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications.
In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input.
We add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance.
arXiv Detail & Related papers (2021-08-03T09:16:55Z) - The heads hypothesis: A unifying statistical approach towards
understanding multi-headed attention in BERT [18.13834903235249]
Multi-headed attention heads are a mainstay in transformer-based models.
Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention.
We formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference.
arXiv Detail & Related papers (2021-01-22T14:10:59Z) - Multi-Head Self-Attention with Role-Guided Masks [20.955992710112216]
We propose a method to guide the attention heads towards roles identified in prior work as important.
We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input.
Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.
arXiv Detail & Related papers (2020-12-22T21:34:02Z) - Alleviating the Inequality of Attention Heads for Neural Machine
Translation [60.34732031315221]
Recent studies show that the attention heads in Transformer are not equal.
We propose a simple masking method: HeadMask, in two specific ways.
Experiments show that translation improvements are achieved on multiple language pairs.
arXiv Detail & Related papers (2020-09-21T08:14:30Z) - Repulsive Attention: Rethinking Multi-head Attention as Bayesian
Inference [68.12511526813991]
We provide a novel understanding of multi-head attention from a Bayesian perspective.
We propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention.
Experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity.
arXiv Detail & Related papers (2020-09-20T06:32:23Z) - Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections.
Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.