Transformers Can Do Arithmetic with the Right Embeddings
- URL: http://arxiv.org/abs/2405.17399v1
- Date: Mon, 27 May 2024 17:49:18 GMT
- Title: Transformers Can Do Arithmetic with the Right Embeddings
- Authors: Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, Tom Goldstein,
- Abstract summary: We show how to improve the performance of transformers on arithmetic tasks.
We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance.
These gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.
- Score: 75.66545271398704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The poor performance of transformers on arithmetic tasks seems to stem in large part from their inability to keep track of the exact position of each digit inside of a large span of digits. We mend this problem by adding an embedding to each digit that encodes its position relative to the start of the number. In addition to the boost these embeddings provide on their own, we show that this fix enables architectural modifications such as input injection and recurrent layers to improve performance even further. With positions resolved, we can study the logical extrapolation ability of transformers. Can they solve arithmetic problems that are larger and more complex than those in their training data? We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance, achieving up to 99% accuracy on 100 digit addition problems. Finally, we show that these gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.
Related papers
- Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized.
We find that these random transformers can perform a wide range of meaningful algorithmic tasks.
Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z) - Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks [5.522116934552708]
Large language models (LLMs) have demonstrated impressive versatility across numerous tasks, yet their generalization capabilities remain poorly understood.
We show that models with appropriate positional embeddings can correctly perform longer unseen arithmetic operations such as addition.
We also show that models perform well for longer unseen cases in modular addition under specific moduli (e.g., modulo 100) but struggle under very close moduli (e.g., modulo 101)
These findings deepen our understanding of the generalization mechanisms, and facilitate more data-efficient model training and objective-oriented AI alignment.
arXiv Detail & Related papers (2024-07-25T11:35:22Z) - Dissecting Multiplication in Transformers: Insights into LLMs [23.109124772063574]
We focus on a typical arithmetic task, integer multiplication, to explore and explain the imperfection of transformers in this domain.
We provide comprehensive analysis of a vanilla transformer trained to perform n-digit integer multiplication.
We propose improvements to enhance transformers performance on multiplication tasks.
arXiv Detail & Related papers (2024-07-22T04:07:26Z) - Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure [42.89515104473087]
We propose position coupling, a simple yet effective method that embeds the structure of the tasks into the positional encoding of a Transformer.
We show that our models trained on 1 to 30-digit additions can generalize up to 200-digit additions.
We also demonstrate that position coupling can be applied to other algorithmic tasks such as Nx2 multiplication and a two-dimensional task.
arXiv Detail & Related papers (2024-05-31T08:13:35Z) - Positional Description Matters for Transformers Arithmetic [58.4739272381373]
Transformers often falter on arithmetic tasks despite their vast capabilities.
We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
arXiv Detail & Related papers (2023-11-22T00:31:01Z) - Multiplication-Free Transformer Training via Piecewise Affine Operations [44.99157696237478]
We replace multiplication with a cheap piecewise affine approximation that is achieved by adding the bit representation of the floating point numbers together as integers.
We show that transformers can be trained with the resulting modified matrix multiplications on both vision and language tasks with little to no performance impact.
arXiv Detail & Related papers (2023-05-26T18:28:28Z) - Inducing Transformer's Compositional Generalization Ability via
Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions.
Existing neural models have been shown to lack this basic ability in learning symbolic structures.
We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.