What Algorithms can Transformers Learn? A Study in Length Generalization
- URL: http://arxiv.org/abs/2310.16028v1
- Date: Tue, 24 Oct 2023 17:43:29 GMT
- Title: What Algorithms can Transformers Learn? A Study in Length Generalization
- Authors: Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi,
Josh Susskind, Samy Bengio, Preetum Nakkiran
- Abstract summary: We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks.
Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer.
Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
- Score: 23.970598914609916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models exhibit surprising emergent generalization properties,
yet also struggle on many simple reasoning tasks such as arithmetic and parity.
This raises the question of if and when Transformer models can learn the true
algorithm for solving a task. We study the scope of Transformers' abilities in
the specific setting of length generalization on algorithmic tasks. Here, we
propose a unifying framework to understand when and how Transformers can
exhibit strong length generalization on a given task. Specifically, we leverage
RASP (Weiss et al., 2021) -- a programming language designed for the
computational model of a Transformer -- and introduce the RASP-Generalization
Conjecture: Transformers tend to length generalize on a task if the task can be
solved by a short RASP program which works for all input lengths. This simple
conjecture remarkably captures most known instances of length generalization on
algorithmic tasks. Moreover, we leverage our insights to drastically improve
generalization performance on traditionally hard tasks (such as parity and
addition). On the theoretical side, we give a simple example where the
"min-degree-interpolator" model of learning from Abbe et al. (2023) does not
correctly predict Transformers' out-of-distribution behavior, but our
conjecture does. Overall, our work provides a novel perspective on the
mechanisms of compositional generalization and the algorithmic capabilities of
Transformers.
Related papers
- Universal Length Generalization with Turing Programs [24.077722969687898]
We propose Turing Programs, a strategy that decomposes an algorithmic task into steps mimicking a Turing Machine.
By using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks.
We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task.
arXiv Detail & Related papers (2024-07-03T17:53:44Z) - Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding.
We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z) - From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers [7.011373967209572]
We show that transformer models are able to generalize to long lengths with the help of targeted attention biasing.
We demonstrate that using ABC, the transformer model can achieve unprecedented near-perfect length generalization on certain arithmetic tasks.
arXiv Detail & Related papers (2023-10-18T14:10:47Z) - Counting and Algorithmic Generalization with Transformers [0.0]
We show that standard Transformers are based on architectural decisions that hinder out-of-distribution performance.
We demonstrate that a modified transformer can exhibit a good algorithmic generalization performance on counting.
arXiv Detail & Related papers (2023-10-12T18:39:24Z) - ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis [54.18659323181771]
We characterize several different forms of compositional generalization that are desirable in program synthesis.
We propose ExeDec, a novel decomposition-based strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step.
arXiv Detail & Related papers (2023-07-26T01:07:52Z) - Learning Transformer Programs [78.9509560355733]
We introduce a procedure for training Transformers that are mechanistically interpretable by design.
Instead of compiling human-written programs into Transformers, we design a modified Transformer that can be trained using gradient-based optimization.
The Transformer Programs can automatically find reasonable solutions, performing on par with standard Transformers of comparable size.
arXiv Detail & Related papers (2023-06-01T20:27:01Z) - The Impact of Positional Encoding on Length Generalization in
Transformers [50.48278691801413]
We compare the length generalization performance of decoder-only Transformers with five different position encoding approaches.
Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks.
arXiv Detail & Related papers (2023-05-31T00:29:55Z) - Systematic Generalization and Emergent Structures in Transformers
Trained on Structured Tasks [6.525090891505941]
We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions.
We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition.
These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
arXiv Detail & Related papers (2022-10-02T00:46:36Z) - Compositional Generalization and Decomposition in Neural Program
Synthesis [59.356261137313275]
In this paper, we focus on measuring the ability of learned program synthesizers to compositionally generalize.
We first characterize several different axes along which program synthesis methods would be desired to generalize.
We introduce a benchmark suite of tasks to assess these abilities based on two popular existing datasets.
arXiv Detail & Related papers (2022-04-07T22:16:05Z) - The Neural Data Router: Adaptive Control Flow in Transformers Improves
Systematic Generalization [8.424405898986118]
We propose two modifications to the Transformer architecture, copy gate and geometric attention.
Our novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task.
NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing.
arXiv Detail & Related papers (2021-10-14T21:24:27Z) - Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language.
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.
We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.