The Impact of Positional Encoding on Length Generalization in
Transformers
- URL: http://arxiv.org/abs/2305.19466v2
- Date: Mon, 6 Nov 2023 19:48:10 GMT
- Title: The Impact of Positional Encoding on Length Generalization in
Transformers
- Authors: Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy,
Payel Das, Siva Reddy
- Abstract summary: We compare the length generalization performance of decoder-only Transformers with five different position encoding approaches.
Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks.
- Score: 50.48278691801413
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Length generalization, the ability to generalize from small training context
sizes to larger ones, is a critical challenge in the development of
Transformer-based language models. Positional encoding (PE) has been identified
as a major factor influencing length generalization, but the exact impact of
different PE schemes on extrapolation in downstream tasks remains unclear. In
this paper, we conduct a systematic empirical study comparing the length
generalization performance of decoder-only Transformers with five different
position encoding approaches including Absolute Position Embedding (APE), T5's
Relative PE, ALiBi, and Rotary, in addition to Transformers without positional
encoding (NoPE). Our evaluation encompasses a battery of reasoning and
mathematical tasks. Our findings reveal that the most commonly used positional
encoding methods, such as ALiBi, Rotary, and APE, are not well suited for
length generalization in downstream tasks. More importantly, NoPE outperforms
other explicit positional encoding methods while requiring no additional
computation. We theoretically demonstrate that NoPE can represent both absolute
and relative PEs, but when trained with SGD, it mostly resembles T5's relative
PE attention patterns. Finally, we find that scratchpad is not always helpful
to solve length generalization and its format highly impacts the model's
performance. Overall, our work suggests that explicit position embeddings are
not essential for decoder-only Transformers to generalize well to longer
sequences.
Related papers
- Length Generalization of Causal Transformers without Position Encoding [59.802708262402824]
Generalizing to longer sentences is important for recent Transformer-based language models.
We study the length generalization property of Transformers without position encodings.
We find that although NoPE can extend to sequences longer than the commonly used explicit position encodings, it still has a limited context length.
arXiv Detail & Related papers (2024-04-18T14:38:32Z) - Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding.
We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - CAPE: Encoding Relative Positions with Continuous Augmented Positional
Embeddings [33.87449556591022]
We propose an augmentation-based approach (CAPE) for absolute positional embeddings.
CAPE keeps the advantages of both absolute (simplicity and speed) and relative position embeddings (better generalization)
arXiv Detail & Related papers (2021-06-06T14:54:55Z) - Relative Positional Encoding for Transformers with Linear Complexity [30.48367640796256]
relative positional encoding (RPE) was proposed as beneficial for classical Transformers.
RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix.
In this paper, we present precisely what is precisely what is a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE.
arXiv Detail & Related papers (2021-05-18T09:52:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.