Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks
- URL: http://arxiv.org/abs/2406.01895v1
- Date: Tue, 4 Jun 2024 02:00:07 GMT
- Title: Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks
- Authors: Mahdi Sabbaghi, George Pappas, Hamed Hassani, Surbhi Goel,
- Abstract summary: We show that Transformers fail to generalize over length on basic arithmetic tasks such as addition and multiplication.
A major reason behind this failure is the vast difference in structure between numbers and text.
We propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings.
- Score: 32.81985604969825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the success of Transformers on language understanding, code generation, and logical reasoning, they still fail to generalize over length on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure between numbers and text; For example, the numbers are typically parsed from right to left, and there is a correspondence between digits at the same position across different numbers. In contrast, for text, such symmetries are quite unnatural. In this work, we propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings. Empirically, our method allows a Transformer trained on numbers with at most 5-digits for addition and multiplication to generalize up to 50-digit numbers, without using additional data for longer sequences. We further demonstrate that traditional absolute positional encodings (APE) fail to generalize to longer sequences, even when trained with augmented data that captures task symmetries. To elucidate the importance of explicitly encoding structure, we prove that explicit incorporation of structure via positional encodings is necessary for out-of-distribution generalization. Finally, we pinpoint other challenges inherent to length generalization beyond capturing symmetries, in particular complexity of the underlying task, and propose changes in the training distribution to address them.
Related papers
- Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks [5.522116934552708]
Large language models (LLMs) have demonstrated impressive versatility across numerous tasks, yet their generalization capabilities remain poorly understood.
We show that models with appropriate positional embeddings can correctly perform longer unseen arithmetic operations such as addition.
We also show that models perform well for longer unseen cases in modular addition under specific moduli (e.g., modulo 100) but struggle under very close moduli (e.g., modulo 101)
These findings deepen our understanding of the generalization mechanisms, and facilitate more data-efficient model training and objective-oriented AI alignment.
arXiv Detail & Related papers (2024-07-25T11:35:22Z) - Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure [42.89515104473087]
We propose position coupling, a simple yet effective method that embeds the structure of the tasks into the positional encoding of a Transformer.
We show that our models trained on 1 to 30-digit additions can generalize up to 200-digit additions.
We also demonstrate that position coupling can be applied to other algorithmic tasks such as Nx2 multiplication and a two-dimensional task.
arXiv Detail & Related papers (2024-05-31T08:13:35Z) - Transformers Can Do Arithmetic with the Right Embeddings [75.66545271398704]
We show how to improve the performance of transformers on arithmetic tasks.
We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance.
These gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.
arXiv Detail & Related papers (2024-05-27T17:49:18Z) - Length Generalization of Causal Transformers without Position Encoding [59.802708262402824]
Generalizing to longer sentences is important for recent Transformer-based language models.
We study the length generalization property of Transformers without position encodings.
We find that although NoPE can extend to sequences longer than the commonly used explicit position encodings, it still has a limited context length.
arXiv Detail & Related papers (2024-04-18T14:38:32Z) - Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding.
We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z) - Positional Description Matters for Transformers Arithmetic [58.4739272381373]
Transformers often falter on arithmetic tasks despite their vast capabilities.
We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
arXiv Detail & Related papers (2023-11-22T00:31:01Z) - Randomized Positional Encodings Boost Length Generalization of
Transformers [14.814408238614165]
Transformers have impressive generalization capabilities on tasks with a fixed context length.
They fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string.
We introduce a novel family of positional encodings that can overcome this problem.
arXiv Detail & Related papers (2023-05-26T11:47:52Z) - Real-World Compositional Generalization with Disentangled
Sequence-to-Sequence Learning [81.24269148865555]
A recently proposed Disentangled sequence-to-sequence model (Dangle) shows promising generalization capability.
We introduce two key modifications to this model which encourage more disentangled representations and improve its compute and memory efficiency.
Specifically, instead of adaptively re-encoding source keys and values at each time step, we disentangle their representations and only re-encode keys periodically.
arXiv Detail & Related papers (2022-12-12T15:40:30Z) - Investigating the Limitations of the Transformers with Simple Arithmetic
Tasks [10.23804850480924]
We find that how a number is represented in its surface form has a strong influence on the model's accuracy.
We conclude that modern pretrained language models can easily learn arithmetic from very few examples.
arXiv Detail & Related papers (2021-02-25T17:22:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.