Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation
- URL: http://arxiv.org/abs/2002.10260v3
- Date: Mon, 5 Oct 2020 16:10:31 GMT
- Title: Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation
- Authors: Alessandro Raganato, Yves Scherrer and J\"org Tiedemann
- Abstract summary: We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
- Score: 73.11214377092121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have brought a radical change to neural machine
translation. A key feature of the Transformer architecture is the so-called
multi-head attention mechanism, which allows the model to focus simultaneously
on different parts of the input. However, recent works have shown that most
attention heads learn simple, and often redundant, positional patterns. In this
paper, we propose to replace all but one attention head of each encoder layer
with simple fixed -- non-learnable -- attentive patterns that are solely based
on position and do not require any external knowledge. Our experiments with
different data sizes and multiple language pairs show that fixing the attention
heads on the encoder side of the Transformer at training time does not impact
the translation quality and even increases BLEU scores by up to 3 points in
low-resource scenarios.
Related papers
- iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - Sparsity and Sentence Structure in Encoder-Decoder Attention of
Summarization Systems [38.672160430296536]
Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization.
Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder.
This work focuses on the transformer's encoder-decoder attention mechanism.
arXiv Detail & Related papers (2021-09-08T19:32:42Z) - Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning.
We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Multi-branch Attentive Transformer [152.07840447196384]
We propose a simple yet effective variant of Transformer called multi-branch attentive Transformer.
The attention layer is the average of multiple branches and each branch is an independent multi-head attention layer.
Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements.
arXiv Detail & Related papers (2020-06-18T04:24:28Z) - Hard-Coded Gaussian Attention for Neural Machine Translation [39.55545092068489]
We develop a "hard-coded" attention variant without any learned parameters.
replacing all learned self-attention heads in the encoder and decoder with fixed, input-agnostic Gaussian distributions minimally impacts BLEU scores across four different language pairs.
Much of this BLEU drop can be recovered by adding just a single learned cross attention head to an otherwise hard-coded Transformer.
arXiv Detail & Related papers (2020-05-02T08:16:13Z) - Hierarchical Transformer Network for Utterance-level Emotion Recognition [0.0]
We address some challenges in utter-ance-level emotion recognition (ULER)
Unlike the traditional text classification problem, this task is supported by a limited number of datasets.
We use a pretrained language model bidirectional encoder representa-tions from transformers (BERT) as the lower-level transformer.
In addition, we add speaker embeddings to the model for the first time, which enables our model to capture the in-teraction between speakers.
arXiv Detail & Related papers (2020-02-18T13:44:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.