Only 5\% Attention Is All You Need: Efficient Long-range Document-level
Neural Machine Translation
- URL: http://arxiv.org/abs/2309.14174v1
- Date: Mon, 25 Sep 2023 14:33:47 GMT
- Title: Only 5\% Attention Is All You Need: Efficient Long-range Document-level
Neural Machine Translation
- Authors: Zihan Liu, Zewei Sun, Shanbo Cheng, Shujian Huang, Mingxuan Wang
- Abstract summary: Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse phenomena by introducing document-level context information.
One of the most important directions is to input the whole document directly to the standard Transformer model.
In this work, we keep the translation performance while gaining 20% speed up by introducing extra selection layer based on lightweight attention that selects a small portion of tokens to be attended.
- Score: 70.87670058323239
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document-level Neural Machine Translation (DocNMT) has been proven crucial
for handling discourse phenomena by introducing document-level context
information. One of the most important directions is to input the whole
document directly to the standard Transformer model. In this case, efficiency
becomes a critical concern due to the quadratic complexity of the attention
module. Existing studies either focus on the encoder part, which cannot be
deployed on sequence-to-sequence generation tasks, e.g., Machine Translation
(MT), or suffer from a significant performance drop. In this work, we keep the
translation performance while gaining 20\% speed up by introducing extra
selection layer based on lightweight attention that selects a small portion of
tokens to be attended. It takes advantage of the original attention to ensure
performance and dimension reduction to accelerate inference. Experimental
results show that our method could achieve up to 95\% sparsity (only 5\% tokens
attended) approximately, and save 93\% computation cost on the attention module
compared with the original Transformer, while maintaining the performance.
Related papers
- FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - Input-length-shortening and text generation via attention values [1.8222946691865871]
We show that the first layer's attention sums can be used to filter tokens in a given sequence.
We also show that retaining approximately 6% of the original sequence is sufficient to obtain 86.5% accuracy.
arXiv Detail & Related papers (2023-03-14T02:11:24Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z) - Long-Short Term Masking Transformer: A Simple but Effective Baseline for
Document-level Neural Machine Translation [28.94748226472447]
We study the pros and cons of the standard transformer in document-level translation.
We propose a surprisingly simple long-short term masking self-attention on top of the standard transformer.
We can achieve a strong result in BLEU and capture discourse phenomena.
arXiv Detail & Related papers (2020-09-19T00:29:51Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.