Learning Hard Retrieval Decoder Attention for Transformers
- URL: http://arxiv.org/abs/2009.14658v2
- Date: Fri, 10 Sep 2021 00:17:54 GMT
- Title: Learning Hard Retrieval Decoder Attention for Transformers
- Authors: Hongfei Xu and Qiuhui Liu and Josef van Genabith and Deyi Xiong
- Abstract summary: Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
- Score: 69.40942736249397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Transformer translation model is based on the multi-head attention
mechanism, which can be parallelized easily. The multi-head attention network
performs the scaled dot-product attention function in parallel, empowering the
model by jointly attending to information from different representation
subspaces at different positions. In this paper, we present an approach to
learning a hard retrieval attention where an attention head only attends to one
token in the sentence rather than all tokens. The matrix multiplication between
attention probabilities and the value sequence in the standard scaled
dot-product attention can thus be replaced by a simple and efficient retrieval
operation. We show that our hard retrieval attention mechanism is 1.43 times
faster in decoding, while preserving translation quality on a wide range of
machine translation tasks when used in the decoder self- and cross-attention
networks.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - Adaptive Sparse and Monotonic Attention for Transformer-based Automatic
Speech Recognition [32.45255303465946]
We introduce sparse attention and monotonic attention into Transformer-based ASR.
The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.
arXiv Detail & Related papers (2022-09-30T01:55:57Z) - Sparsity and Sentence Structure in Encoder-Decoder Attention of
Summarization Systems [38.672160430296536]
Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization.
Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder.
This work focuses on the transformer's encoder-decoder attention mechanism.
arXiv Detail & Related papers (2021-09-08T19:32:42Z) - Generic Attention-model Explainability for Interpreting Bi-Modal and
Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture.
Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z) - Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections.
Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.