Multi-Head Self-Attention with Role-Guided Masks
- URL: http://arxiv.org/abs/2012.12366v1
- Date: Tue, 22 Dec 2020 21:34:02 GMT
- Title: Multi-Head Self-Attention with Role-Guided Masks
- Authors: Dongsheng Wang and Casper Hansen and Lucas Chaves Lima and Christian
Hansen and Maria Maistro and Jakob Grue Simonsen and Christina Lioma
- Abstract summary: We propose a method to guide the attention heads towards roles identified in prior work as important.
We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input.
Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.
- Score: 20.955992710112216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The state of the art in learning meaningful semantic representations of words
is the Transformer model and its attention mechanisms. Simply put, the
attention mechanisms learn to attend to specific parts of the input dispensing
recurrence and convolutions. While some of the learned attention heads have
been found to play linguistically interpretable roles, they can be redundant or
prone to errors. We propose a method to guide the attention heads towards roles
identified in prior work as important. We do this by defining role-specific
masks to constrain the heads to attend to specific parts of the input, such
that different heads are designed to play different roles. Experiments on text
classification and machine translation using 7 different datasets show that our
method outperforms competitive attention-based, CNN, and RNN baselines.
Related papers
- Picking the Underused Heads: A Network Pruning Perspective of Attention
Head Selection for Fusing Dialogue Coreference Information [50.41829484199252]
Transformer-based models with the multi-head self-attention mechanism are widely used in natural language processing.
We investigate the attention head selection and manipulation strategy for feature injection from a network pruning perspective.
arXiv Detail & Related papers (2023-12-15T05:27:24Z) - Attention Lens: A Tool for Mechanistically Interpreting the Attention
Head Information Retrieval Mechanism [4.343604069244352]
We propose Attention Lens, a tool that enables researchers to translate the outputs of attention heads into vocabulary tokens.
Preliminary findings from our trained lenses indicate that attention heads play highly specialized roles in language models.
arXiv Detail & Related papers (2023-10-25T01:03:35Z) - Analysis of Self-Attention Head Diversity for Conformer-based Automatic
Speech Recognition [36.53453860656191]
We investigate approaches to increasing attention head diversity.
We show that introducing diversity-promoting auxiliary loss functions during training is a more effective approach.
Finally, we draw a connection between the diversity of attention heads and the similarity of the gradients of head parameters.
arXiv Detail & Related papers (2022-09-13T15:50:03Z) - Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning.
We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z) - A Dynamic Head Importance Computation Mechanism for Neural Machine
Translation [22.784419165117512]
Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications.
In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input.
We add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance.
arXiv Detail & Related papers (2021-08-03T09:16:55Z) - Dodrio: Exploring Transformer Models with Interactive Visualization [10.603327364971559]
Dodrio is an open-source interactive visualization tool to help NLP researchers and practitioners analyze attention mechanisms in transformer-based models with linguistic knowledge.
To facilitate the visual comparison of attention weights and linguistic knowledge, Dodrio applies different graph visualization techniques to represent attention weights with longer input text.
arXiv Detail & Related papers (2021-03-26T17:39:37Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections.
Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z) - Salience Estimation with Multi-Attention Learning for Abstractive Text
Summarization [86.45110800123216]
In the task of text summarization, salience estimation for words, phrases or sentences is a critical component.
We propose a Multi-Attention Learning framework which contains two new attention learning components for salience estimation.
arXiv Detail & Related papers (2020-04-07T02:38:56Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.