Investigating the Limitations of the Transformers with Simple Arithmetic
Tasks
- URL: http://arxiv.org/abs/2102.13019v1
- Date: Thu, 25 Feb 2021 17:22:53 GMT
- Title: Investigating the Limitations of the Transformers with Simple Arithmetic
Tasks
- Authors: Rodrigo Nogueira, Zhiying Jiang, Jimmy Li
- Abstract summary: We find that how a number is represented in its surface form has a strong influence on the model's accuracy.
We conclude that modern pretrained language models can easily learn arithmetic from very few examples.
- Score: 10.23804850480924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to perform arithmetic tasks is a remarkable trait of human
intelligence and might form a critical component of more complex reasoning
tasks. In this work, we investigate if the surface form of a number has any
influence on how sequence-to-sequence language models learn simple arithmetic
tasks such as addition and subtraction across a wide range of values. We find
that how a number is represented in its surface form has a strong influence on
the model's accuracy. In particular, the model fails to learn addition of
five-digit numbers when using subwords (e.g., "32"), and it struggles to learn
with character-level representations (e.g., "3 2"). By introducing position
tokens (e.g., "3 10e1 2"), the model learns to accurately add and subtract
numbers up to 60 digits. We conclude that modern pretrained language models can
easily learn arithmetic from very few examples, as long as we use the proper
surface representation. This result bolsters evidence that subword tokenizers
and positional encodings are components in current transformer designs that
might need improvement. Moreover, we show that regardless of the number of
parameters and training examples, models cannot learn addition rules that are
independent of the length of the numbers seen during training. Code to
reproduce our experiments is available at
https://github.com/castorini/transformers-arithmetic
Related papers
- Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized.
We find that these random transformers can perform a wide range of meaningful algorithmic tasks.
Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z) - Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia [55.23627698804683]
We study the scaling behavior of different numeral systems in the context of transformer-based large language models.
A base $10$ system is consistently more data-efficient than a base $102$ or $103$ system across training data scale.
We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations.
arXiv Detail & Related papers (2024-09-25T22:08:31Z) - Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks [5.522116934552708]
Large language models (LLMs) have demonstrated impressive versatility across numerous tasks, yet their generalization capabilities remain poorly understood.
We show that models with appropriate positional embeddings can correctly perform longer unseen arithmetic operations such as addition.
We also show that models perform well for longer unseen cases in modular addition under specific moduli (e.g., modulo 100) but struggle under very close moduli (e.g., modulo 101)
These findings deepen our understanding of the generalization mechanisms, and facilitate more data-efficient model training and objective-oriented AI alignment.
arXiv Detail & Related papers (2024-07-25T11:35:22Z) - How to Leverage Digit Embeddings to Represent Numbers? [13.880400817682059]
Generalisations, such as solving 100+200 instead of 1+2, can substantially affect model performance.
Character-level embeddings of numbers have emerged as a promising approach to improve number representation.
We use mathematical priors to compute aggregated digit embeddings and explicitly incorporate these aggregates into transformer models.
arXiv Detail & Related papers (2024-07-01T01:31:41Z) - Positional Description Matters for Transformers Arithmetic [58.4739272381373]
Transformers often falter on arithmetic tasks despite their vast capabilities.
We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
arXiv Detail & Related papers (2023-11-22T00:31:01Z) - Length Generalization in Arithmetic Transformers [41.62455986786115]
We show how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training.
We propose train set priming: adding a few ($10$ to $50$) long sequences to the training set.
We show that priming allows models trained on $5$-digit $times$ $3$-digit multiplications to generalize to $35times 3$ examples.
arXiv Detail & Related papers (2023-06-27T11:53:25Z) - Is Integer Arithmetic Enough for Deep Learning Training? [2.9136421025415205]
replacing floating-point arithmetic with low-bit integer arithmetic is a promising approach to save energy, memory footprint, and latency of deep learning models.
We propose a fully functional integer training pipeline including forward pass, back-propagation, and gradient descent.
Our experimental results show that our proposed method is effective in a wide variety of tasks such as classification (including vision transformers), object detection, and semantic segmentation.
arXiv Detail & Related papers (2022-07-18T22:36:57Z) - NumGPT: Improving Numeracy Ability of Generative Pre-trained Models [59.931394234642816]
We propose NumGPT, a generative pre-trained model that explicitly models the numerical properties of numbers in texts.
Specifically, it leverages a prototype-based numeral embedding to encode the mantissa of the number and an individual embedding to encode the exponent of the number.
A numeral-aware loss function is designed to integrate numerals into the pre-training objective of NumGPT.
arXiv Detail & Related papers (2021-09-07T15:06:12Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z) - When is Memorization of Irrelevant Training Data Necessary for
High-Accuracy Learning? [53.523017945443115]
We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples.
Our results do not depend on the training algorithm or the class of models used for learning.
arXiv Detail & Related papers (2020-12-11T15:25:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.