Relative Positional Encoding for Transformers with Linear Complexity
- URL: http://arxiv.org/abs/2105.08399v1
- Date: Tue, 18 May 2021 09:52:32 GMT
- Title: Relative Positional Encoding for Transformers with Linear Complexity
- Authors: Antoine Liutkus, Ond\v{r}ej C\'ifka, Shih-Lun Wu, Umut
\c{S}im\c{s}ekli, Yi-Hsuan Yang, Ga\"el Richard
- Abstract summary: relative positional encoding (RPE) was proposed as beneficial for classical Transformers.
RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix.
In this paper, we present precisely what is precisely what is a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE.
- Score: 30.48367640796256
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in Transformer models allow for unprecedented sequence
lengths, due to linear space and time complexity. In the meantime, relative
positional encoding (RPE) was proposed as beneficial for classical Transformers
and consists in exploiting lags instead of absolute positions for inference.
Still, RPE is not available for the recent linear-variants of the Transformer,
because it requires the explicit computation of the attention matrix, which is
precisely what is avoided by such methods. In this paper, we bridge this gap
and present Stochastic Positional Encoding as a way to generate PE that can be
used as a replacement to the classical additive (sinusoidal) PE and provably
behaves like RPE. The main theoretical contribution is to make a connection
between positional encoding and cross-covariance structures of correlated
Gaussian processes. We illustrate the performance of our approach on the
Long-Range Arena benchmark and on music generation.
Related papers
- PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - Linearized Relative Positional Encoding [43.898057545832366]
Relative positional encoding is widely used in vanilla and linear transformers to represent positional information.
We put together a variety of existing linear relative positional encoding approaches under a canonical form.
We further propose a family of linear relative positional encoding algorithms via unitary transformation.
arXiv Detail & Related papers (2023-07-18T13:56:43Z) - The Impact of Positional Encoding on Length Generalization in
Transformers [50.48278691801413]
We compare the length generalization performance of decoder-only Transformers with five different position encoding approaches.
Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks.
arXiv Detail & Related papers (2023-05-31T00:29:55Z) - Application of Transformers for Nonlinear Channel Compensation in Optical Systems [0.23499129784547654]
We introduce a new nonlinear optical channel equalizer based on Transformers.
By leveraging parallel computation and attending directly to the memory across a sequence of symbols, we show that Transformers can be used effectively for nonlinear compensation.
arXiv Detail & Related papers (2023-04-25T19:48:54Z) - Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers [71.32827362323205]
We propose a new class of linear Transformers calledLearner-Transformers (Learners)
They incorporate a wide range of relative positional encoding mechanisms (RPEs)
These include regular RPE techniques applied for sequential data, as well as novel RPEs operating on geometric data embedded in higher-dimensional Euclidean spaces.
arXiv Detail & Related papers (2023-02-03T18:57:17Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.