A Dynamic Head Importance Computation Mechanism for Neural Machine
Translation
- URL: http://arxiv.org/abs/2108.01377v1
- Date: Tue, 3 Aug 2021 09:16:55 GMT
- Title: A Dynamic Head Importance Computation Mechanism for Neural Machine
Translation
- Authors: Akshay Goindani and Manish Shrivastava
- Abstract summary: Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications.
In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input.
We add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance.
- Score: 22.784419165117512
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multiple parallel attention mechanisms that use multiple attention heads
facilitate greater performance of the Transformer model for various
applications e.g., Neural Machine Translation (NMT), text classification. In
multi-head attention mechanism, different heads attend to different parts of
the input. However, the limitation is that multiple heads might attend to the
same part of the input, resulting in multiple heads being redundant. Thus, the
model resources are under-utilized. One approach to avoid this is to prune
least important heads based on certain importance score. In this work, we focus
on designing a Dynamic Head Importance Computation Mechanism (DHICM) to
dynamically calculate the importance of a head with respect to the input. Our
insight is to design an additional attention layer together with multi-head
attention, and utilize the outputs of the multi-head attention along with the
input, to compute the importance for each head. Additionally, we add an extra
loss function to prevent the model from assigning same score to all heads, to
identify more important heads and improvise performance. We analyzed
performance of DHICM for NMT with different languages. Experiments on different
datasets show that DHICM outperforms traditional Transformer-based approach by
large margin, especially, when less training data is available.
Related papers
- Picking the Underused Heads: A Network Pruning Perspective of Attention
Head Selection for Fusing Dialogue Coreference Information [50.41829484199252]
Transformer-based models with the multi-head self-attention mechanism are widely used in natural language processing.
We investigate the attention head selection and manipulation strategy for feature injection from a network pruning perspective.
arXiv Detail & Related papers (2023-12-15T05:27:24Z) - Pit One Against Many: Leveraging Attention-head Embeddings for
Parameter-efficient Multi-head Attention [42.92397219764559]
We propose an alternative module that uses only a single shared projection matrix and multiple head embeddings (MHE)
We empirically demonstrate that our MHE attention is substantially more memory efficient compared to alternative attention mechanisms.
arXiv Detail & Related papers (2023-10-11T21:38:40Z) - Analysis of Self-Attention Head Diversity for Conformer-based Automatic
Speech Recognition [36.53453860656191]
We investigate approaches to increasing attention head diversity.
We show that introducing diversity-promoting auxiliary loss functions during training is a more effective approach.
Finally, we draw a connection between the diversity of attention heads and the similarity of the gradients of head parameters.
arXiv Detail & Related papers (2022-09-13T15:50:03Z) - Multiformer: A Head-Configurable Transformer-Based Model for Direct
Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head.
By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions.
Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z) - Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning.
We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z) - Multi-Head Self-Attention with Role-Guided Masks [20.955992710112216]
We propose a method to guide the attention heads towards roles identified in prior work as important.
We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input.
Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.
arXiv Detail & Related papers (2020-12-22T21:34:02Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections.
Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z) - Low-Rank Bottleneck in Multi-head Attention Models [74.83235382203604]
We argue that the scaling between the number of heads and the size of each head in the current architecture gives rise to a low-rank bottleneck in attention heads.
We propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power.
arXiv Detail & Related papers (2020-02-17T16:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.