Related papers: Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model

Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model

URL: http://arxiv.org/abs/2305.16340v3
Date: Mon, 23 Oct 2023 01:44:58 GMT
Title: Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model
Authors: Yinghan Long, Sayeed Shafayet Chowdhury, Kaushik Roy
Abstract summary: We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention. The proposed model achieves $6-22%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches.
Score: 10.473819332984005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have shown dominant performance across a range of domains including language and vision. However, their computational cost grows quadratically with the sequence length, making their usage prohibitive for resource-constrained applications. To counter this, our approach is to divide the whole sequence into segments and apply attention to the individual segments. We propose a segmented recurrent transformer (SRformer) that combines segmented (local) attention with recurrent attention. The loss caused by reducing the attention window length is compensated by aggregating information across segments with recurrent attention. SRformer leverages Recurrent Accumulate-and-Fire (RAF) neurons' inherent memory to update the cumulative product of keys and values. The segmented attention and lightweight RAF neurons ensure the efficiency of the proposed transformer. Such an approach leads to models with sequential processing capability at a lower computation/memory cost. We apply the proposed method to T5 and BART transformers. The modified models are tested on summarization datasets including CNN-dailymail, XSUM, ArXiv, and MediaSUM. Notably, using segmented inputs of varied sizes, the proposed model achieves $6-22\%$ higher ROUGE1 scores than a segmented transformer and outperforms other recurrent transformer approaches. Furthermore, compared to full attention, the proposed model reduces the computational complexity of cross attention by around $40\%$.

Related papers

A temporal scale transformer framework for precise remaining useful life prediction in fuel cells [10.899223392837936]
Temporal Scale Transformer (TSTransformer) is an enhanced version of the inverted Transformer (iTransformer) Unlike traditional Transformers that treat each timestep as an input token, TSTransformer maps sequences of varying lengths into tokens at different stages for inter-sequence modeling. It improves local feature extraction, captures temporal scale characteristics, and reduces token count and computational costs.
arXiv Detail & Related papers (2025-04-08T23:42:54Z)
ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers [0.0]
Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. We propose to cluster the transformer input on the basis of its entropy. Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage.
arXiv Detail & Related papers (2024-09-11T18:03:59Z)
CAST: Clustering Self-Attention using Surrogate Tokens for Efficient Transformers [3.129187821625805]
We propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention and achieve efficient transformers. CAST improves efficiency by reducing the complexity from $O(N2)$ to $O(alpha N)$ where N is the sequence length, and alpha is constant according to the number of clusters and samples per cluster.
arXiv Detail & Related papers (2024-02-06T18:47:52Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling. It incorporates all token interactions within one attention layer while maintaining low computation and memory costs. We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z)
Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation [6.303594714446706]
Self-attention mechanism gauges pairwise correlations across entire input sequence. Despite favorable performance, calculating pairwise correlations is prohibitively costly. This work addresses these constraints by architecting an accelerator, called SPRINT, which computes attention scores in an approximate manner.
arXiv Detail & Related papers (2022-09-01T17:18:19Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation [58.4650849317274]
Volumetric Aggregation with Transformers (VAT) is a cost aggregation network for few-shot segmentation. VAT attains state-of-the-art performance for semantic correspondence as well, where cost aggregation also plays a central role.
arXiv Detail & Related papers (2022-07-22T04:10:30Z)
nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution. nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.