Related papers: Length Generalization of Causal Transformers without Position Encoding

Length Generalization of Causal Transformers without Position Encoding

URL: http://arxiv.org/abs/2404.12224v2
Date: Tue, 28 May 2024 01:38:59 GMT
Title: Length Generalization of Causal Transformers without Position Encoding
Authors: Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, Xiaoling Wang,
Abstract summary: Generalizing to longer sentences is important for recent Transformer-based language models. We study the length generalization property of Transformers without position encodings. We find that although NoPE can extend to sequences longer than the commonly used explicit position encodings, it still has a limited context length.
Score: 59.802708262402824
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible

Related papers

The Role of Sparsity for Length Generalization in Transformers [58.65997625433689]
We propose a new theoretical framework to study length generalization for the next-token prediction task. We show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach.
arXiv Detail & Related papers (2025-02-24T03:01:03Z)
A Formal Framework for Understanding Length Generalization in Transformers [14.15513446489798]
We introduce a rigorous theoretical framework to analyze length generalization in causal transformers. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks.
arXiv Detail & Related papers (2024-10-03T01:52:01Z)
Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding. We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z)
Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding [40.289596031245374]
All Transformer-based models including large language models (LLMs) suffer from a preset length limit. Numerous methods have emerged to enhance the length extrapolation of Transformers. This survey aims to enable the reader to gain a deep understanding of existing methods and provide stimuli for future research.
arXiv Detail & Related papers (2023-12-28T14:42:24Z)
HyPE: Attention with Hyperbolic Biases for Relative Positional Encoding [0.0]
In Transformer-based architectures, the attention mechanism is inherently permutation-invariant with respect to the input sequence's tokens. We introduce Hyperbolic Positional Attention (HyPE), a novel method that utilizes hyperbolic functions' properties to encode tokens' relative positions.
arXiv Detail & Related papers (2023-10-30T15:54:32Z)
Functional Interpolation for Relative Positions Improves Long Context Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z)
LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z)
The Impact of Positional Encoding on Length Generalization in Transformers [50.48278691801413]
We compare the length generalization performance of decoder-only Transformers with five different position encoding approaches. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks.
arXiv Detail & Related papers (2023-05-31T00:29:55Z)
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one. With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.