Related papers: Learning Hard Retrieval Decoder Attention for Transformers

Learning Hard Retrieval Decoder Attention for Transformers

URL: http://arxiv.org/abs/2009.14658v2
Date: Fri, 10 Sep 2021 00:17:54 GMT
Title: Learning Hard Retrieval Decoder Attention for Transformers
Authors: Hongfei Xu and Qiuhui Liu and Josef van Genabith and Deyi Xiong
Abstract summary: Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
Score: 69.40942736249397
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by jointly attending to information from different representation subspaces at different positions. In this paper, we present an approach to learning a hard retrieval attention where an attention head only attends to one token in the sentence rather than all tokens. The matrix multiplication between attention probabilities and the value sequence in the standard scaled dot-product attention can thus be replaced by a simple and efficient retrieval operation. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding, while preserving translation quality on a wide range of machine translation tasks when used in the decoder self- and cross-attention networks.

Related papers

ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans [13.695885742446027]
Self-attention can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. We introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport. Our method enforces doubleity without iterative Sinkhorn normalization, significantly enhancing efficiency.
arXiv Detail & Related papers (2025-02-11T21:20:48Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads [10.169639612525643]
We propose a new multi-head self-attention (MHSA) variant named Fibottention, which can replace MHSA in Transformer architectures. Fibottention is data-efficient and computationally more suitable for processing large numbers of tokens than the standard MHSA. It employs structured sparse attention based on dilated Fibonacci sequences, which, uniquely, differ across attention heads, resulting in-like diverse features across heads.
arXiv Detail & Related papers (2024-06-27T17:59:40Z)
FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z)
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones. We find that without any input-dependent attention, all models achieve competitive performance. We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z)
Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition [32.45255303465946]
We introduce sparse attention and monotonic attention into Transformer-based ASR. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.
arXiv Detail & Related papers (2022-09-30T01:55:57Z)
Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems [38.672160430296536]
Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization. Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder. This work focuses on the transformer's encoder-decoder attention mechanism.
arXiv Detail & Related papers (2021-09-08T19:32:42Z)
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture. Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z)
Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections. Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.