Related papers: A Formal Framework for Understanding Length Generalization in Transformers

A Formal Framework for Understanding Length Generalization in Transformers

URL: http://arxiv.org/abs/2410.02140v1
Date: Thu, 3 Oct 2024 01:52:01 GMT
Title: A Formal Framework for Understanding Length Generalization in Transformers
Authors: Xinting Huang, Andy Yang, Satwik Bhattamishra, Yash Sarrof, Andreas Krebs, Hattie Zhou, Preetum Nakkiran, Michael Hahn,
Abstract summary: We introduce a rigorous theoretical framework to analyze length generalization in causal transformers. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks.
Score: 14.15513446489798
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A major challenge for transformers is generalizing to sequences longer than those observed during training. While previous works have empirically shown that transformers can either succeed or fail at length generalization depending on the task, theoretical understanding of this phenomenon remains limited. In this work, we introduce a rigorous theoretical framework to analyze length generalization in causal transformers with learnable absolute positional encodings. In particular, we characterize those functions that are identifiable in the limit from sufficiently long inputs with absolute positional encodings under an idealized inference scheme using a norm-based regularizer. This enables us to prove the possibility of length generalization for a rich family of problems. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks. Our theory not only explains a broad set of empirical observations but also opens the way to provably predicting length generalization capabilities in transformers.

Related papers

The Role of Sparsity for Length Generalization in Transformers [58.65997625433689]
We propose a new theoretical framework to study length generalization for the next-token prediction task. We show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach.
arXiv Detail & Related papers (2025-02-24T03:01:03Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count [19.148785141454642]
Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers.
arXiv Detail & Related papers (2024-10-21T08:49:51Z)
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527837]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps. We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z)
Length Generalization of Causal Transformers without Position Encoding [59.802708262402824]
Generalizing to longer sentences is important for recent Transformer-based language models. We study the length generalization property of Transformers without position encodings. We find that although NoPE can extend to sequences longer than the commonly used explicit position encodings, it still has a limited context length.
arXiv Detail & Related papers (2024-04-18T14:38:32Z)
Prompting a Pretrained Transformer Can Be a Universal Approximator [105.59562522323274]
We show that much smaller pretrained models than previously thought can be universal approximators when prefixed. We also offer Jackson-type bounds on the length of the prefix needed to approximate a function to a desired precision.
arXiv Detail & Related papers (2024-02-22T18:12:48Z)
Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding. We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z)
On Provable Length and Compositional Generalization [7.883808173871223]
We provide first provable guarantees on length and compositional generalization for common sequence-to-sequence models. We show that emphsimple limited capacity versions of these different architectures achieve both length and compositional generalization.
arXiv Detail & Related papers (2024-02-07T14:16:28Z)
What Algorithms can Transformers Learn? A Study in Length Generalization [23.970598914609916]
We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer. Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
arXiv Detail & Related papers (2023-10-24T17:43:29Z)
From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers [7.011373967209572]
We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks.
arXiv Detail & Related papers (2023-10-18T14:10:47Z)
The Impact of Positional Encoding on Length Generalization in Transformers [50.48278691801413]
We compare the length generalization performance of decoder-only Transformers with five different position encoding approaches. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks.
arXiv Detail & Related papers (2023-05-31T00:29:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.