Alleviating the Inequality of Attention Heads for Neural Machine
Translation
- URL: http://arxiv.org/abs/2009.09672v2
- Date: Wed, 31 Aug 2022 11:50:22 GMT
- Title: Alleviating the Inequality of Attention Heads for Neural Machine
Translation
- Authors: Zewei Sun, Shujian Huang, Xin-Yu Dai, Jiajun Chen
- Abstract summary: Recent studies show that the attention heads in Transformer are not equal.
We propose a simple masking method: HeadMask, in two specific ways.
Experiments show that translation improvements are achieved on multiple language pairs.
- Score: 60.34732031315221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies show that the attention heads in Transformer are not equal. We
relate this phenomenon to the imbalance training of multi-head attention and
the model dependence on specific heads. To tackle this problem, we propose a
simple masking method: HeadMask, in two specific ways. Experiments show that
translation improvements are achieved on multiple language pairs. Subsequent
empirical analyses also support our assumption and confirm the effectiveness of
the method.
Related papers
- Evade the Trap of Mediocrity: Promoting Diversity and Novelty in Text
Generation via Concentrating Attention [85.5379146125199]
Powerful Transformer architectures have proven superior in generating high-quality sentences.
In this work, we find that sparser attention values in Transformer could improve diversity.
We introduce a novel attention regularization loss to control the sharpness of the attention distribution.
arXiv Detail & Related papers (2022-11-14T07:53:16Z) - Analysis of Self-Attention Head Diversity for Conformer-based Automatic
Speech Recognition [36.53453860656191]
We investigate approaches to increasing attention head diversity.
We show that introducing diversity-promoting auxiliary loss functions during training is a more effective approach.
Finally, we draw a connection between the diversity of attention heads and the similarity of the gradients of head parameters.
arXiv Detail & Related papers (2022-09-13T15:50:03Z) - Multiformer: A Head-Configurable Transformer-Based Model for Direct
Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head.
By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions.
Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z) - Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning.
We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z) - Do Multilingual Neural Machine Translation Models Contain Language Pair
Specific Attention Heads? [16.392272086563175]
This paper aims to analyze individual components of a multilingual neural translation (NMT) model.
We look at the encoder self-attention and encoder-decoder attention heads that are more specific to the translation of a certain language pair than others.
Experimental results show that surprisingly, the set of most important attention heads are very similar across the language pairs.
arXiv Detail & Related papers (2021-05-31T13:15:55Z) - Multi-Head Self-Attention with Role-Guided Masks [20.955992710112216]
We propose a method to guide the attention heads towards roles identified in prior work as important.
We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input.
Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.
arXiv Detail & Related papers (2020-12-22T21:34:02Z) - Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework
of Vision-and-Language BERTs [57.74359320513427]
Methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI.
We study the differences between these two categories, and show how they can be unified under a single theoretical framework.
We conduct controlled experiments to discern the empirical differences between five V&L BERTs.
arXiv Detail & Related papers (2020-11-30T18:55:24Z) - Uncertainty-Aware Semantic Augmentation for Neural Machine Translation [37.555675157198145]
We propose uncertainty-aware semantic augmentation, which explicitly captures the universal semantic information among multiple semantically-equivalent source sentences.
Our approach significantly outperforms the strong baselines and the existing methods.
arXiv Detail & Related papers (2020-10-09T07:48:09Z) - A Mixture of $h-1$ Heads is Better than $h$ Heads [63.12336930345417]
We propose the mixture of attentive experts model (MAE)
Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks.
Our analysis shows that our model learns to specialize different experts to different inputs.
arXiv Detail & Related papers (2020-05-13T19:05:58Z) - Translation Artifacts in Cross-lingual Transfer Learning [51.66536640084888]
We show that machine translation can introduce subtle artifacts that have a notable impact in existing cross-lingual models.
In natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them.
We also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.
arXiv Detail & Related papers (2020-04-09T17:54:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.