Mega: Moving Average Equipped Gated Attention
- URL: http://arxiv.org/abs/2209.10655v1
- Date: Wed, 21 Sep 2022 20:52:17 GMT
- Title: Mega: Moving Average Equipped Gated Attention
- Authors: Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham
Neubig, Jonathan May, Luke Zettlemoyer
- Abstract summary: Mega is a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average.
We show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.
- Score: 150.3124713793503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The design choices in the Transformer attention mechanism, including weak
inductive bias and quadratic computational complexity, have limited its
application for modeling long sequences. In this paper, we introduce Mega, a
simple, theoretically grounded, single-head gated attention mechanism equipped
with (exponential) moving average to incorporate inductive bias of
position-aware local dependencies into the position-agnostic attention
mechanism. We further propose a variant of Mega that offers linear time and
space complexity yet yields only minimal quality loss, by efficiently splitting
the whole sequence into multiple chunks with fixed length. Extensive
experiments on a wide range of sequence modeling benchmarks, including the Long
Range Arena, neural machine translation, auto-regressive language modeling, and
image and speech classification, show that Mega achieves significant
improvements over other sequence models, including variants of Transformers and
recent state space models.
Related papers
- LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers [78.85346970193518]
Megabyte is a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes.
Experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling.
Results establish the viability of tokenization-free autoregressive sequence modeling at scale.
arXiv Detail & Related papers (2023-05-12T00:55:41Z) - LSG Attention: Extrapolation of pretrained Transformers to long
sequences [0.0]
We introduce the LSG architecture which relies on Local, Sparse and Global attention.
We show that LSG attention is fast, efficient and competitive in classification and summarization tasks on long documents.
We propose tools to train new models and adapt existing ones based on this mechanism.
arXiv Detail & Related papers (2022-10-13T13:10:41Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - Adaptive Multi-Resolution Attention with Linear Complexity [18.64163036371161]
We propose a novel structure named Adaptive Multi-Resolution Attention (AdaMRA) for short.
We leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion.
To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.
arXiv Detail & Related papers (2021-08-10T23:17:16Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR)
Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors.
It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.