Random Feature Attention
- URL: http://arxiv.org/abs/2103.02143v1
- Date: Wed, 3 Mar 2021 02:48:56 GMT
- Title: Random Feature Attention
- Authors: Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith,
Lingpeng Kong
- Abstract summary: We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function.
RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism.
Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines.
- Score: 69.4671822971207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers are state-of-the-art models for a variety of sequence modeling
tasks. At their core is an attention function which models pairwise
interactions between the inputs at every timestep. While attention is powerful,
it does not scale efficiently to long sequences due to its quadratic time and
space complexity in the sequence length. We propose RFA, a linear time and
space attention that uses random feature methods to approximate the softmax
function, and explore its application in transformers. RFA can be used as a
drop-in replacement for conventional softmax attention and offers a
straightforward way of learning with recency bias through an optional gating
mechanism. Experiments on language modeling and machine translation demonstrate
that RFA achieves similar or better performance compared to strong transformer
baselines. In the machine translation experiment, RFA decodes twice as fast as
a vanilla transformer. Compared to existing efficient transformer variants, RFA
is competitive in terms of both accuracy and efficiency on three long text
classification datasets. Our analysis shows that RFA's efficiency gains are
especially notable on long sequences, suggesting that RFA will be particularly
useful in tasks that require working with large inputs, fast decoding speed, or
low memory footprints.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - Robust representations of oil wells' intervals via sparse attention
mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers)
The focus in our experiments is on oil&gas data, namely, well logs.
To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z) - Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for
Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling.
It incorporates all token interactions within one attention layer while maintaining low computation and memory costs.
We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z) - FastRPB: a Scalable Relative Positional Encoding for Long Sequence Tasks [0.2538209532048867]
We introduce FastRPB, which efficiently adds positional information to self-attention.
FastRPB has O(N log(N)) computational complexity, requiring O(N) memory w.r.t. input sequence length N.
arXiv Detail & Related papers (2022-02-23T09:12:00Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Informer: Beyond Efficient Transformer for Long Sequence Time-Series
Forecasting [25.417560221400347]
Long sequence time-series forecasting (LSTF) demands a high prediction capacity.
Recent studies have shown the potential of Transformer to increase the prediction capacity.
We design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics.
arXiv Detail & Related papers (2020-12-14T11:43:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.