Transformers Can Do Arithmetic with the Right Embeddings
- URL: http://arxiv.org/abs/2405.17399v1
- Date: Mon, 27 May 2024 17:49:18 GMT
- Title: Transformers Can Do Arithmetic with the Right Embeddings
- Authors: Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, Tom Goldstein,
- Abstract summary: We show how to improve the performance of transformers on arithmetic tasks.
We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance.
These gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.
- Score: 75.66545271398704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The poor performance of transformers on arithmetic tasks seems to stem in large part from their inability to keep track of the exact position of each digit inside of a large span of digits. We mend this problem by adding an embedding to each digit that encodes its position relative to the start of the number. In addition to the boost these embeddings provide on their own, we show that this fix enables architectural modifications such as input injection and recurrent layers to improve performance even further. With positions resolved, we can study the logical extrapolation ability of transformers. Can they solve arithmetic problems that are larger and more complex than those in their training data? We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance, achieving up to 99% accuracy on 100 digit addition problems. Finally, we show that these gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.
Related papers
- Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks [5.522116934552708]
Large language models (LLMs) have demonstrated impressive versatility across numerous tasks, yet their generalization capabilities remain poorly understood.
We show that models with appropriate positional embeddings can correctly perform longer unseen arithmetic operations such as addition.
We also show that models perform well for longer unseen cases in modular addition under specific moduli (e.g., modulo 100) but struggle under very close moduli (e.g., modulo 101)
These findings deepen our understanding of the generalization mechanisms, and facilitate more data-efficient model training and objective-oriented AI alignment.
arXiv Detail & Related papers (2024-07-25T11:35:22Z) - Dissecting Multiplication in Transformers: Insights into LLMs [23.109124772063574]
We focus on a typical arithmetic task, integer multiplication, to explore and explain the imperfection of transformers in this domain.
We provide comprehensive analysis of a vanilla transformer trained to perform n-digit integer multiplication.
We propose improvements to enhance transformers performance on multiplication tasks.
arXiv Detail & Related papers (2024-07-22T04:07:26Z) - Position Coupling: Leveraging Task Structure for Improved Length Generalization of Transformers [42.89515104473087]
We propose position coupling, a simple yet effective method that embeds the structure of the tasks into the positional encoding of a Transformer.
We show that a small (1-layer) Transformer trained on 1 to 30-digit additions can generalize up to 200-digit additions.
We also demonstrate that position coupling can be applied to other algorithmic tasks such as addition with multiple summands, Nx2 multiplication, copy/reverse, and a two-dimensional task.
arXiv Detail & Related papers (2024-05-31T08:13:35Z) - Positional Description Matters for Transformers Arithmetic [58.4739272381373]
Transformers often falter on arithmetic tasks despite their vast capabilities.
We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
arXiv Detail & Related papers (2023-11-22T00:31:01Z) - Learning Transformer Programs [78.9509560355733]
We introduce a procedure for training Transformers that are mechanistically interpretable by design.
Instead of compiling human-written programs into Transformers, we design a modified Transformer that can be trained using gradient-based optimization.
The Transformer Programs can automatically find reasonable solutions, performing on par with standard Transformers of comparable size.
arXiv Detail & Related papers (2023-06-01T20:27:01Z) - Multiplication-Free Transformer Training via Piecewise Affine Operations [44.99157696237478]
We replace multiplication with a cheap piecewise affine approximation that is achieved by adding the bit representation of the floating point numbers together as integers.
We show that transformers can be trained with the resulting modified matrix multiplications on both vision and language tasks with little to no performance impact.
arXiv Detail & Related papers (2023-05-26T18:28:28Z) - Inducing Transformer's Compositional Generalization Ability via
Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions.
Existing neural models have been shown to lack this basic ability in learning symbolic structures.
We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.