Hard-Coded Gaussian Attention for Neural Machine Translation
- URL: http://arxiv.org/abs/2005.00742v1
- Date: Sat, 2 May 2020 08:16:13 GMT
- Title: Hard-Coded Gaussian Attention for Neural Machine Translation
- Authors: Weiqiu You, Simeng Sun, Mohit Iyyer
- Abstract summary: We develop a "hard-coded" attention variant without any learned parameters.
replacing all learned self-attention heads in the encoder and decoder with fixed, input-agnostic Gaussian distributions minimally impacts BLEU scores across four different language pairs.
Much of this BLEU drop can be recovered by adding just a single learned cross attention head to an otherwise hard-coded Transformer.
- Score: 39.55545092068489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has questioned the importance of the Transformer's multi-headed
attention for achieving high translation quality. We push further in this
direction by developing a "hard-coded" attention variant without any learned
parameters. Surprisingly, replacing all learned self-attention heads in the
encoder and decoder with fixed, input-agnostic Gaussian distributions minimally
impacts BLEU scores across four different language pairs. However, additionally
hard-coding cross attention (which connects the decoder to the encoder)
significantly lowers BLEU, suggesting that it is more important than
self-attention. Much of this BLEU drop can be recovered by adding just a single
learned cross attention head to an otherwise hard-coded Transformer. Taken as a
whole, our results offer insight into which components of the Transformer are
actually important, which we hope will guide future work into the development
of simpler and more efficient attention-based models.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z) - Tighter Bounds on the Expressivity of Transformer Encoders [9.974865253097127]
We identify a variant of first-order logic with counting quantifiers that is simultaneously an upper bound for fixed-precision transformer encoders and a lower bound for transformer encoders.
This brings us much closer than before to an exact characterization of the languages that transformer encoders recognize.
arXiv Detail & Related papers (2023-01-25T18:05:55Z) - Sparsity and Sentence Structure in Encoder-Decoder Attention of
Summarization Systems [38.672160430296536]
Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization.
Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder.
This work focuses on the transformer's encoder-decoder attention mechanism.
arXiv Detail & Related papers (2021-09-08T19:32:42Z) - PiSLTRc: Position-informed Sign Language Transformer with Content-aware
Convolution [0.42970700836450487]
We propose a new model architecture, namely PiSLTRc, with two distinctive characteristics.
We explicitly select relevant features using a novel content-aware neighborhood gathering method.
We aggregate these features with position-informed temporal convolution layers, thus generating robust neighborhood-enhanced sign representation.
Compared with the vanilla Transformer model, our model performs consistently better on three large-scale sign language benchmarks.
arXiv Detail & Related papers (2021-07-27T05:01:27Z) - On the Sub-Layer Functionalities of Transformer Decoder [74.83087937309266]
We study how Transformer-based decoders leverage information from the source and target languages.
Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
arXiv Detail & Related papers (2020-10-06T11:50:54Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.