Related papers: Transformers Can Achieve Length Generalization But Not Robustly

Transformers Can Achieve Length Generalization But Not Robustly

URL: http://arxiv.org/abs/2402.09371v1
Date: Wed, 14 Feb 2024 18:18:29 GMT
Title: Transformers Can Achieve Length Generalization But Not Robustly
Authors: Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, Denny Zhou
Abstract summary: We show that the success of length generalization is intricately linked to the data format and the type of position encoding. We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
Score: 76.06308648699357
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.

Related papers

On Vanishing Variance in Transformer Length Generalization [23.706900145711913]
We show that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
The Role of Sparsity for Length Generalization in Transformers [58.65997625433689]
We propose a new theoretical framework to study length generalization for the next-token prediction task. We show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach.
arXiv Detail & Related papers (2025-02-24T03:01:03Z)
Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count [19.148785141454642]
Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers.
arXiv Detail & Related papers (2024-10-21T08:49:51Z)
A Formal Framework for Understanding Length Generalization in Transformers [14.15513446489798]
We introduce a rigorous theoretical framework to analyze length generalization in causal transformers. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks.
arXiv Detail & Related papers (2024-10-03T01:52:01Z)
Looped Transformers for Length Generalization [41.99378201613648]
We show that looped Transformers with an adaptive number of steps significantly improve length generalization. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.
arXiv Detail & Related papers (2024-09-24T01:21:17Z)
Length Generalization of Causal Transformers without Position Encoding [59.802708262402824]
Generalizing to longer sentences is important for recent Transformer-based language models. We study the length generalization property of Transformers without position encodings. We find that although NoPE can extend to sequences longer than the commonly used explicit position encodings, it still has a limited context length.
arXiv Detail & Related papers (2024-04-18T14:38:32Z)
Functional Interpolation for Relative Positions Improves Long Context Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z)
LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z)
The Impact of Positional Encoding on Length Generalization in Transformers [50.48278691801413]
We compare the length generalization performance of decoder-only Transformers with five different position encoding approaches. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks.
arXiv Detail & Related papers (2023-05-31T00:29:55Z)
Randomized Positional Encodings Boost Length Generalization of Transformers [14.814408238614165]
Transformers have impressive generalization capabilities on tasks with a fixed context length. They fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string. We introduce a novel family of positional encodings that can overcome this problem.
arXiv Detail & Related papers (2023-05-26T11:47:52Z)
A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We introduce a relative position embedding to explicitly maximize attention resolution. We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z)
Sequence Length is a Domain: Length-based Overfitting in Transformer Models [0.0]
In machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches. We show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
arXiv Detail & Related papers (2021-09-15T13:25:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.