Related papers: Multi-Head Self-Attention with Role-Guided Masks

Multi-Head Self-Attention with Role-Guided Masks

URL: http://arxiv.org/abs/2012.12366v1
Date: Tue, 22 Dec 2020 21:34:02 GMT
Title: Multi-Head Self-Attention with Role-Guided Masks
Authors: Dongsheng Wang and Casper Hansen and Lucas Chaves Lima and Christian Hansen and Maria Maistro and Jakob Grue Simonsen and Christina Lioma
Abstract summary: We propose a method to guide the attention heads towards roles identified in prior work as important. We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input. Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.
Score: 20.955992710112216
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors. We propose a method to guide the attention heads towards roles identified in prior work as important. We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input, such that different heads are designed to play different roles. Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.

Related papers

Picking the Underused Heads: A Network Pruning Perspective of Attention Head Selection for Fusing Dialogue Coreference Information [50.41829484199252]
Transformer-based models with the multi-head self-attention mechanism are widely used in natural language processing. We investigate the attention head selection and manipulation strategy for feature injection from a network pruning perspective.
arXiv Detail & Related papers (2023-12-15T05:27:24Z)
Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism [4.343604069244352]
We propose Attention Lens, a tool that enables researchers to translate the outputs of attention heads into vocabulary tokens. Preliminary findings from our trained lenses indicate that attention heads play highly specialized roles in language models.
arXiv Detail & Related papers (2023-10-25T01:03:35Z)
Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition [36.53453860656191]
We investigate approaches to increasing attention head diversity. We show that introducing diversity-promoting auxiliary loss functions during training is a more effective approach. Finally, we draw a connection between the diversity of attention heads and the similarity of the gradients of head parameters.
arXiv Detail & Related papers (2022-09-13T15:50:03Z)
Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning. We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z)
A Dynamic Head Importance Computation Mechanism for Neural Machine Translation [22.784419165117512]
Multiple parallel attention mechanisms that use multiple attention heads facilitate greater performance of the Transformer model for various applications. In this work, we focus on designing a Dynamic Head Importance Computation Mechanism (DHICM) to dynamically calculate the importance of a head with respect to the input. We add an extra loss function to prevent the model from assigning same score to all heads, to identify more important heads and improvise performance.
arXiv Detail & Related papers (2021-08-03T09:16:55Z)
Dodrio: Exploring Transformer Models with Interactive Visualization [10.603327364971559]
Dodrio is an open-source interactive visualization tool to help NLP researchers and practitioners analyze attention mechanisms in transformer-based models with linguistic knowledge. To facilitate the visual comparison of attention weights and linguistic knowledge, Dodrio applies different graph visualization techniques to represent attention weights with longer input text.
arXiv Detail & Related papers (2021-03-26T17:39:37Z)
Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z)
Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections. Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z)
Salience Estimation with Multi-Attention Learning for Abstractive Text Summarization [86.45110800123216]
In the task of text summarization, salience estimation for words, phrases or sentences is a critical component. We propose a Multi-Attention Learning framework which contains two new attention learning components for salience estimation.
arXiv Detail & Related papers (2020-04-07T02:38:56Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.