LSG Attention: Extrapolation of pretrained Transformers to long
sequences
- URL: http://arxiv.org/abs/2210.15497v1
- Date: Thu, 13 Oct 2022 13:10:41 GMT
- Title: LSG Attention: Extrapolation of pretrained Transformers to long
sequences
- Authors: Charles Condevaux and S\'ebastien Harispe
- Abstract summary: We introduce the LSG architecture which relies on Local, Sparse and Global attention.
We show that LSG attention is fast, efficient and competitive in classification and summarization tasks on long documents.
We propose tools to train new models and adapt existing ones based on this mechanism.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer models achieve state-of-the-art performance on a wide range of
NLP tasks. They however suffer from a prohibitive limitation due to the
self-attention mechanism, inducing $O(n^2)$ complexity with regard to sequence
length. To answer this limitation we introduce the LSG architecture which
relies on Local, Sparse and Global attention. We show that LSG attention is
fast, efficient and competitive in classification and summarization tasks on
long documents. Interestingly, it can also be used to adapt existing pretrained
models to efficiently extrapolate to longer sequences with no additional
training. Along with the introduction of the LSG attention mechanism, we
propose tools to train new models and adapt existing ones based on this
mechanism.
Related papers
- Local Attention Mechanism: Boosting the Transformer Architecture for Long-Sequence Time Series Forecasting [8.841114905151152]
Local Attention Mechanism (LAM) is an efficient attention mechanism tailored for time series analysis.
LAM exploits the continuity properties of time series to reduce the number of attention scores computed.
We present an algorithm for implementing LAM in algebra tensor that runs in time and memory O(nlogn)
arXiv Detail & Related papers (2024-10-04T11:32:02Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - On the Long Range Abilities of Transformers [69.3021852589771]
We demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena benchmark.
We identify that two key principles for long-range tasks are (i.e. incorporating an inductive bias towards smoothness, and (ii.e.) locality.
As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters.
arXiv Detail & Related papers (2023-11-28T09:21:48Z) - Efficient Long-Range Transformers: You Need to Attend More, but Not
Necessarily at Every Layer [36.75562615596186]
We propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans.
MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers.
Experiments show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention.
arXiv Detail & Related papers (2023-10-19T03:32:05Z) - Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace.
We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers.
Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z) - Mega: Moving Average Equipped Gated Attention [150.3124713793503]
Mega is a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average.
We show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.
arXiv Detail & Related papers (2022-09-21T20:52:17Z) - Long Range Language Modeling via Gated State Spaces [67.64091993846269]
We focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles.
We propose a new layer named Gated State Space (GSS) and show that it trains significantly faster than the diagonal version of S4.
arXiv Detail & Related papers (2022-06-27T01:50:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.