Related papers: Hard-Coded Gaussian Attention for Neural Machine Translation

Hard-Coded Gaussian Attention for Neural Machine Translation

URL: http://arxiv.org/abs/2005.00742v1
Date: Sat, 2 May 2020 08:16:13 GMT
Title: Hard-Coded Gaussian Attention for Neural Machine Translation
Authors: Weiqiu You, Simeng Sun, Mohit Iyyer
Abstract summary: We develop a "hard-coded" attention variant without any learned parameters. replacing all learned self-attention heads in the encoder and decoder with fixed, input-agnostic Gaussian distributions minimally impacts BLEU scores across four different language pairs. Much of this BLEU drop can be recovered by adding just a single learned cross attention head to an otherwise hard-coded Transformer.
Score: 39.55545092068489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has questioned the importance of the Transformer's multi-headed attention for achieving high translation quality. We push further in this direction by developing a "hard-coded" attention variant without any learned parameters. Surprisingly, replacing all learned self-attention heads in the encoder and decoder with fixed, input-agnostic Gaussian distributions minimally impacts BLEU scores across four different language pairs. However, additionally hard-coding cross attention (which connects the decoder to the encoder) significantly lowers BLEU, suggesting that it is more important than self-attention. Much of this BLEU drop can be recovered by adding just a single learned cross attention head to an otherwise hard-coded Transformer. Taken as a whole, our results offer insight into which components of the Transformer are actually important, which we hope will guide future work into the development of simpler and more efficient attention-based models.

Related papers

Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer [54.97718043685824]
We present the Hadamard Attention Recurrent Stereo Transformer (HART) that incorporates the following components. For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity. We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses. In reflective area, HART ranked 1st on the KITTI 2012 benchmark among all published methods at the time of submission.
arXiv Detail & Related papers (2025-01-02T02:51:16Z)
Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context. We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise. It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z)
Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence. Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text. Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z)
Tighter Bounds on the Expressivity of Transformer Encoders [9.974865253097127]
We identify a variant of first-order logic with counting quantifiers that is simultaneously an upper bound for fixed-precision transformer encoders and a lower bound for transformer encoders. This brings us much closer than before to an exact characterization of the languages that transformer encoders recognize.
arXiv Detail & Related papers (2023-01-25T18:05:55Z)
Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems [38.672160430296536]
Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization. Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder. This work focuses on the transformer's encoder-decoder attention mechanism.
arXiv Detail & Related papers (2021-09-08T19:32:42Z)
PiSLTRc: Position-informed Sign Language Transformer with Content-aware Convolution [0.42970700836450487]
We propose a new model architecture, namely PiSLTRc, with two distinctive characteristics. We explicitly select relevant features using a novel content-aware neighborhood gathering method. We aggregate these features with position-informed temporal convolution layers, thus generating robust neighborhood-enhanced sign representation. Compared with the vanilla Transformer model, our model performs consistently better on three large-scale sign language benchmarks.
arXiv Detail & Related papers (2021-07-27T05:01:27Z)
On the Sub-Layer Functionalities of Transformer Decoder [74.83087937309266]
We study how Transformer-based decoders leverage information from the source and target languages. Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
arXiv Detail & Related papers (2020-10-06T11:50:54Z)
Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.