Investigating the Limitations of the Transformers with Simple Arithmetic
Tasks
- URL: http://arxiv.org/abs/2102.13019v1
- Date: Thu, 25 Feb 2021 17:22:53 GMT
- Title: Investigating the Limitations of the Transformers with Simple Arithmetic
Tasks
- Authors: Rodrigo Nogueira, Zhiying Jiang, Jimmy Li
- Abstract summary: We find that how a number is represented in its surface form has a strong influence on the model's accuracy.
We conclude that modern pretrained language models can easily learn arithmetic from very few examples.
- Score: 10.23804850480924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to perform arithmetic tasks is a remarkable trait of human
intelligence and might form a critical component of more complex reasoning
tasks. In this work, we investigate if the surface form of a number has any
influence on how sequence-to-sequence language models learn simple arithmetic
tasks such as addition and subtraction across a wide range of values. We find
that how a number is represented in its surface form has a strong influence on
the model's accuracy. In particular, the model fails to learn addition of
five-digit numbers when using subwords (e.g., "32"), and it struggles to learn
with character-level representations (e.g., "3 2"). By introducing position
tokens (e.g., "3 10e1 2"), the model learns to accurately add and subtract
numbers up to 60 digits. We conclude that modern pretrained language models can
easily learn arithmetic from very few examples, as long as we use the proper
surface representation. This result bolsters evidence that subword tokenizers
and positional encodings are components in current transformer designs that
might need improvement. Moreover, we show that regardless of the number of
parameters and training examples, models cannot learn addition rules that are
independent of the length of the numbers seen during training. Code to
reproduce our experiments is available at
https://github.com/castorini/transformers-arithmetic
Related papers
- Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks [5.522116934552708]
Large language models (LLMs) have demonstrated impressive versatility across numerous tasks, yet their generalization capabilities remain poorly understood.
We show that models with appropriate positional embeddings can correctly perform longer unseen arithmetic operations such as addition.
We also show that models perform well for longer unseen cases in modular addition under specific moduli (e.g., modulo 100) but struggle under very close moduli (e.g., modulo 101)
These findings deepen our understanding of the generalization mechanisms, and facilitate more data-efficient model training and objective-oriented AI alignment.
arXiv Detail & Related papers (2024-07-25T11:35:22Z) - How to Leverage Digit Embeddings to Represent Numbers? [13.880400817682059]
Generalisations, such as solving 100+200 instead of 1+2, can substantially affect model performance.
Character-level embeddings of numbers have emerged as a promising approach to improve number representation.
We use mathematical priors to compute aggregated digit embeddings and explicitly incorporate these aggregates into transformer models.
arXiv Detail & Related papers (2024-07-01T01:31:41Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - Positional Description Matters for Transformers Arithmetic [58.4739272381373]
Transformers often falter on arithmetic tasks despite their vast capabilities.
We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
arXiv Detail & Related papers (2023-11-22T00:31:01Z) - Length Generalization in Arithmetic Transformers [41.62455986786115]
We show how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training.
We propose train set priming: adding a few ($10$ to $50$) long sequences to the training set.
We show that priming allows models trained on $5$-digit $times$ $3$-digit multiplications to generalize to $35times 3$ examples.
arXiv Detail & Related papers (2023-06-27T11:53:25Z) - Is Integer Arithmetic Enough for Deep Learning Training? [2.9136421025415205]
replacing floating-point arithmetic with low-bit integer arithmetic is a promising approach to save energy, memory footprint, and latency of deep learning models.
We propose a fully functional integer training pipeline including forward pass, back-propagation, and gradient descent.
Our experimental results show that our proposed method is effective in a wide variety of tasks such as classification (including vision transformers), object detection, and semantic segmentation.
arXiv Detail & Related papers (2022-07-18T22:36:57Z) - Inducing Transformer's Compositional Generalization Ability via
Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions.
Existing neural models have been shown to lack this basic ability in learning symbolic structures.
We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z) - NumGPT: Improving Numeracy Ability of Generative Pre-trained Models [59.931394234642816]
We propose NumGPT, a generative pre-trained model that explicitly models the numerical properties of numbers in texts.
Specifically, it leverages a prototype-based numeral embedding to encode the mantissa of the number and an individual embedding to encode the exponent of the number.
A numeral-aware loss function is designed to integrate numerals into the pre-training objective of NumGPT.
arXiv Detail & Related papers (2021-09-07T15:06:12Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z) - When is Memorization of Irrelevant Training Data Necessary for
High-Accuracy Learning? [53.523017945443115]
We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples.
Our results do not depend on the training algorithm or the class of models used for learning.
arXiv Detail & Related papers (2020-12-11T15:25:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.