Related papers: From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers

From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers

URL: http://arxiv.org/abs/2310.11984v3
Date: Fri, 10 May 2024 13:41:34 GMT
Title: From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers
Authors: Shaoxiong Duan, Yining Shi, Wei Xu,
Abstract summary: We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks.
Score: 7.011373967209572
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and parity. Through experiments and attention analysis, we identify a number of crucial factors for achieving optimal length generalization. We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing. In particular, our solution solves the Parity task, a well-known and theoretically proven failure mode for Transformers. We then introduce Attention Bias Calibration (ABC), a calibration stage that enables the model to automatically learn the proper attention biases, which we show to be connected to mechanisms in relative position encoding. We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks. In addition, we show that ABC bears remarkable similarities to RPE and LoRA, which may indicate the potential for applications to more complex tasks.

Related papers

Extrapolation by Association: Length Generalization Transfer in Transformers [29.659527141850436]
We show that length generalization can be textittransferred across related tasks.<n>Our results show that transformer models can inherit generalization capabilities from similar tasks when trained jointly.<n>We provide initial mechanistic evidence that length generalization transfer correlates with the re-use of the same attention heads between the tasks.
arXiv Detail & Related papers (2025-06-10T21:22:51Z)
The Role of Sparsity for Length Generalization in Transformers [58.65997625433689]
We propose a new theoretical framework to study length generalization for the next-token prediction task. We show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach.
arXiv Detail & Related papers (2025-02-24T03:01:03Z)
A Unified Perspective on the Dynamics of Deep Transformers [24.094975798576783]
We study the evolution of data anisotropy through a deep Transformer. We highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
arXiv Detail & Related papers (2025-01-30T13:04:54Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
A Formal Framework for Understanding Length Generalization in Transformers [14.15513446489798]
We introduce a rigorous theoretical framework to analyze length generalization in causal transformers. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks.
arXiv Detail & Related papers (2024-10-03T01:52:01Z)
Positional Attention: Expressivity and Learnability of Algorithmic Computation [6.181408276896225]
This work aims to better understand the role of attention in Transformers for algorithmic execution. We prove that Transformers with positional attention (positional Transformers) maintain the same expressivity of parallel computational models. Our results show that positional Transformers introduce a learning trade-off: while they exhibit better theoretical dependence on parameter norms, certain tasks may require more layers.
arXiv Detail & Related papers (2024-10-02T15:55:08Z)
Looped Transformers for Length Generalization [41.99378201613648]
We show that looped Transformers with an adaptive number of steps significantly improve length generalization. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.
arXiv Detail & Related papers (2024-09-24T01:21:17Z)
Investigating Recurrent Transformers with Dynamic Halt [64.862738244735]
We study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism. We propose and investigate novel ways to extend and combine the methods.
arXiv Detail & Related papers (2024-02-01T19:47:31Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
What Algorithms can Transformers Learn? A Study in Length Generalization [23.970598914609916]
We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer. Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
arXiv Detail & Related papers (2023-10-24T17:43:29Z)
Transformers as Algorithms: Generalization and Implicit Model Selection in In-context Learning [23.677503557659705]
In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of examples and performs inference on-the-fly. We treat the transformer model as a learning algorithm that can be specialized via training to implement-at inference-time-another target algorithm. We show that transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes.
arXiv Detail & Related papers (2023-01-17T18:31:12Z)
Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z)
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers [42.93754828584075]
We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR) Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
arXiv Detail & Related papers (2020-06-05T17:09:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.