What Algorithms can Transformers Learn? A Study in Length Generalization
- URL: http://arxiv.org/abs/2310.16028v1
- Date: Tue, 24 Oct 2023 17:43:29 GMT
- Title: What Algorithms can Transformers Learn? A Study in Length Generalization
- Authors: Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi,
Josh Susskind, Samy Bengio, Preetum Nakkiran
- Abstract summary: We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks.
Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer.
Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
- Score: 23.970598914609916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models exhibit surprising emergent generalization properties,
yet also struggle on many simple reasoning tasks such as arithmetic and parity.
This raises the question of if and when Transformer models can learn the true
algorithm for solving a task. We study the scope of Transformers' abilities in
the specific setting of length generalization on algorithmic tasks. Here, we
propose a unifying framework to understand when and how Transformers can
exhibit strong length generalization on a given task. Specifically, we leverage
RASP (Weiss et al., 2021) -- a programming language designed for the
computational model of a Transformer -- and introduce the RASP-Generalization
Conjecture: Transformers tend to length generalize on a task if the task can be
solved by a short RASP program which works for all input lengths. This simple
conjecture remarkably captures most known instances of length generalization on
algorithmic tasks. Moreover, we leverage our insights to drastically improve
generalization performance on traditionally hard tasks (such as parity and
addition). On the theoretical side, we give a simple example where the
"min-degree-interpolator" model of learning from Abbe et al. (2023) does not
correctly predict Transformers' out-of-distribution behavior, but our
conjecture does. Overall, our work provides a novel perspective on the
mechanisms of compositional generalization and the algorithmic capabilities of
Transformers.
Related papers
- Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count [19.148785141454642]
Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training.
In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers.
arXiv Detail & Related papers (2024-10-21T08:49:51Z) - Looped Transformers for Length Generalization [41.99378201613648]
We show that looped Transformers with an adaptive number of steps significantly improve length generalization.
We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.
arXiv Detail & Related papers (2024-09-24T01:21:17Z) - In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - Universal Length Generalization with Turing Programs [24.077722969687898]
We propose Turing Programs, a strategy that decomposes an algorithmic task into steps mimicking a Turing Machine.
By using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks.
We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task.
arXiv Detail & Related papers (2024-07-03T17:53:44Z) - Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding.
We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z) - From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers [7.011373967209572]
We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing.
We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks.
arXiv Detail & Related papers (2023-10-18T14:10:47Z) - Counting and Algorithmic Generalization with Transformers [0.0]
We show that standard Transformers are based on architectural decisions that hinder out-of-distribution performance.
We demonstrate that a modified transformer can exhibit a good algorithmic generalization performance on counting.
arXiv Detail & Related papers (2023-10-12T18:39:24Z) - ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis [54.18659323181771]
We characterize several different forms of compositional generalization that are desirable in program synthesis.
We propose ExeDec, a novel decomposition-based strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step.
arXiv Detail & Related papers (2023-07-26T01:07:52Z) - Learning Transformer Programs [78.9509560355733]
We introduce a procedure for training Transformers that are mechanistically interpretable by design.
Instead of compiling human-written programs into Transformers, we design a modified Transformer that can be trained using gradient-based optimization.
The Transformer Programs can automatically find reasonable solutions, performing on par with standard Transformers of comparable size.
arXiv Detail & Related papers (2023-06-01T20:27:01Z) - Compositional Generalization and Decomposition in Neural Program
Synthesis [59.356261137313275]
In this paper, we focus on measuring the ability of learned program synthesizers to compositionally generalize.
We first characterize several different axes along which program synthesis methods would be desired to generalize.
We introduce a benchmark suite of tasks to assess these abilities based on two popular existing datasets.
arXiv Detail & Related papers (2022-04-07T22:16:05Z) - Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language.
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.
We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.