Context-aware Biases for Length Extrapolation
- URL: http://arxiv.org/abs/2503.08067v1
- Date: Tue, 11 Mar 2025 05:54:58 GMT
- Title: Context-aware Biases for Length Extrapolation
- Authors: Ali Veisi, Amir Mansourian,
- Abstract summary: Transformers' ability to generalize to longer sequences degrades as sequence length increases.<n>Most Relative Positional Edu (RPE) methods address this problem by adding constant linear biases or learning general biases.<n>We propose Context-aware Biases for Length Extrapolation (Cable), that learns token-specific biases for each head in decoder-based transformers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers' ability to generalize to longer sequences than they have been trained on, known as length extrapolation, degrades as sequence length increases. Most of Relative Positional Encoding (RPE) methods address this problem by either adding constant linear biases or learning general biases, lacking the ability to specialize for different sequences. In this work, inspired by ALiBi, we propose Context-aware Biases for Length Extrapolation (Cable), that learns token-specific biases for each head in decoder-based transformers. Cable learns adaptive, context-aware biases, overcoming the limitations of fixed patterns by adding dynamic biases specific to each token in the sequence. Results show that when tested on a sequence length of 1024, a GPT-3 Medium (334M parameters) with our positional encoding, trained on a sequence length of 512, achieves better perplexity (-0.65) than a similar network with sinusoidal positional encoding trained on a sequence length of 1024. This is achieved with 48% lower memory usage, and only 3.5% higher training time. Furthermore, our method notably improves the extrapolation ability of existing RPE methods on the Edu-FineWeb10B and WikiText-103 datasets. Code is available at: https://github.com/axiomlab/Cable
Related papers
- Length Generalization of Causal Transformers without Position Encoding [59.802708262402824]
Generalizing to longer sentences is important for recent Transformer-based language models.
We study the length generalization property of Transformers without position encodings.
We find that although NoPE can extend to sequences longer than the commonly used explicit position encodings, it still has a limited context length.
arXiv Detail & Related papers (2024-04-18T14:38:32Z) - Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding.
We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z) - Ultra-Long Sequence Distributed Transformer [10.263668150008316]
Transformer models trained on long sequences often achieve higher accuracy than short sequences.
Existing methods for long sequence training offer limited speedup and memory reduction.
This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer.
arXiv Detail & Related papers (2023-11-04T11:38:53Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens.
Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z) - Randomized Positional Encodings Boost Length Generalization of
Transformers [14.814408238614165]
Transformers have impressive generalization capabilities on tasks with a fixed context length.
They fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string.
We introduce a novel family of positional encodings that can overcome this problem.
arXiv Detail & Related papers (2023-05-26T11:47:52Z) - Sequence Length is a Domain: Length-based Overfitting in Transformer
Models [0.0]
In machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches.
We show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
arXiv Detail & Related papers (2021-09-15T13:25:19Z) - Train Short, Test Long: Attention with Linear Biases Enables Input
Length Extrapolation [62.51758040848735]
We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation.
ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance.
We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048.
arXiv Detail & Related papers (2021-08-27T17:35:06Z) - Nystr\"omformer: A Nystr\"om-Based Algorithm for Approximating
Self-Attention [60.043273122786005]
We propose Nystr"omformer -- a model that exhibits favorable scalability as a function of sequence length.
The scalability of Nystr"omformer enables application to longer sequences with thousands of tokens.
We perform evaluations on multiple downstream tasks on the GLUE benchmark and reviews with standard sequence length, and find that our Nystr"omformer performs comparably, or in a few cases, even slightly better, than standard Transformer.
arXiv Detail & Related papers (2021-02-07T20:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.