Related papers: Extrapolation by Association: Length Generalization Transfer in Transformers

Extrapolation by Association: Length Generalization Transfer in Transformers

URL: http://arxiv.org/abs/2506.09251v2
Date: Mon, 04 Aug 2025 16:57:32 GMT
Title: Extrapolation by Association: Length Generalization Transfer in Transformers
Authors: Ziyang Cai, Nayoung Lee, Avi Schwarzschild, Samet Oymak, Dimitris Papailiopoulos,
Abstract summary: We show that length generalization can be textittransferred across related tasks.<n>Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly.<n>We provide initial mechanistic evidence that length generalization transfer correlates with the re-use of the same attention heads between the tasks.
Score: 29.659527141850436
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length generalization--the ability to extrapolate from shorter to longer inputs--through the lens of \textit{task association}. We find that length generalization can be \textit{transferred} across related tasks. That is, training a model with a longer and related auxiliary task can lead it to generalize to unseen and longer inputs from some other target task. We demonstrate this length generalization transfer across diverse algorithmic tasks, including arithmetic operations, string transformations, and maze navigation. Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly. Moreover, we observe similar transfer effects in pretrained language models, suggesting that pretraining equips models with reusable computational scaffolding that facilitates extrapolation in downstream settings. Finally, we provide initial mechanistic evidence that length generalization transfer correlates with the re-use of the same attention heads between the tasks. Together, our findings deepen our understanding of how transformers generalize to out-of-distribution inputs and highlight the compositional reuse of inductive structure across tasks.

Related papers

The Role of Sparsity for Length Generalization in Transformers [58.65997625433689]
We propose a new theoretical framework to study length generalization for the next-token prediction task.<n>We show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens.<n>We introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach.
arXiv Detail & Related papers (2025-02-24T03:01:03Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z)
Principled Understanding of Generalization for Generative Transformer Models in Arithmetic Reasoning Tasks [5.522116934552708]
Transformer-based models excel in various tasks but their generalization capabilities, especially in arithmetic reasoning, remain incompletely understood.<n>This paper develops a unified theoretical framework for understanding the generalization behaviors of transformers in arithmetic tasks.
arXiv Detail & Related papers (2024-07-25T11:35:22Z)
Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training. Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking. Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z)
Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers [74.96551626420188]
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures.<n>We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
arXiv Detail & Related papers (2024-04-25T07:10:29Z)
Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding. We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z)
What Algorithms can Transformers Learn? A Study in Length Generalization [23.970598914609916]
We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer. Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
arXiv Detail & Related papers (2023-10-24T17:43:29Z)
From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers [7.011373967209572]
We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks.
arXiv Detail & Related papers (2023-10-18T14:10:47Z)
Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks [6.525090891505941]
We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions. We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition. These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
arXiv Detail & Related papers (2022-10-02T00:46:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.