Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model
- URL: http://arxiv.org/abs/2305.16340v3
- Date: Mon, 23 Oct 2023 01:44:58 GMT
- Title: Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model
- Authors: Yinghan Long, Sayeed Shafayet Chowdhury, Kaushik Roy
- Abstract summary: We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention.
The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
- Score: 10.473819332984005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have shown dominant performance across a range of domains
including language and vision. However, their computational cost grows
quadratically with the sequence length, making their usage prohibitive for
resource-constrained applications. To counter this, our approach is to divide
the whole sequence into segments and apply attention to the individual
segments. We propose a segmented recurrent transformer (SRformer) that combines
segmented (local) attention with recurrent attention. The loss caused by
reducing the attention window length is compensated by aggregating information
across segments with recurrent attention. SRformer leverages Recurrent
Accumulate-and-Fire (RAF) neurons' inherent memory to update the cumulative
product of keys and values. The segmented attention and lightweight RAF neurons
ensure the efficiency of the proposed transformer. Such an approach leads to
models with sequential processing capability at a lower computation/memory
cost. We apply the proposed method to T5 and BART transformers. The modified
models are tested on summarization datasets including CNN-dailymail, XSUM,
ArXiv, and MediaSUM. Notably, using segmented inputs of varied sizes, the
proposed model achieves $6-22\%$ higher ROUGE1 scores than a segmented
transformer and outperforms other recurrent transformer approaches.
Furthermore, compared to full attention, the proposed model reduces the
computational complexity of cross attention by around $40\%$.
Related papers
- ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers [0.0]
Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection.
We propose to cluster the transformer input on the basis of its entropy.
Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage.
arXiv Detail & Related papers (2024-09-11T18:03:59Z) - CAST: Clustering Self-Attention using Surrogate Tokens for Efficient
Transformers [3.129187821625805]
We propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention and achieve efficient transformers.
CAST improves efficiency by reducing the complexity from $O(N2)$ to $O(alpha N)$ where N is the sequence length, and alpha is constant according to the number of clusters and samples per cluster.
arXiv Detail & Related papers (2024-02-06T18:47:52Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for
Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling.
It incorporates all token interactions within one attention layer while maintaining low computation and memory costs.
We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z) - Sparse Attention Acceleration with Synergistic In-Memory Pruning and
On-Chip Recomputation [6.303594714446706]
Self-attention mechanism gauges pairwise correlations across entire input sequence.
Despite favorable performance, calculating pairwise correlations is prohibitively costly.
This work addresses these constraints by architecting an accelerator, called SPRINT, which computes attention scores in an approximate manner.
arXiv Detail & Related papers (2022-09-01T17:18:19Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot
Segmentation [58.4650849317274]
Volumetric Aggregation with Transformers (VAT) is a cost aggregation network for few-shot segmentation.
VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.
arXiv Detail & Related papers (2022-07-22T04:10:30Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.