Positional Description Matters for Transformers Arithmetic
- URL: http://arxiv.org/abs/2311.14737v1
- Date: Wed, 22 Nov 2023 00:31:01 GMT
- Title: Positional Description Matters for Transformers Arithmetic
- Authors: Ruoqi Shen, S\'ebastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li,
Yi Zhang
- Abstract summary: Transformers often falter on arithmetic tasks despite their vast capabilities.
We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
- Score: 58.4739272381373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers, central to the successes in modern Natural Language Processing,
often falter on arithmetic tasks despite their vast capabilities --which
paradoxically include remarkable coding abilities. We observe that a crucial
challenge is their naive reliance on positional information to solve arithmetic
problems with a small number of digits, leading to poor performance on larger
numbers. Herein, we delve deeper into the role of positional encoding, and
propose several ways to fix the issue, either by modifying the positional
encoding directly, or by modifying the representation of the arithmetic task to
leverage standard positional encoding differently. We investigate the value of
these modifications for three tasks: (i) classical multiplication, (ii) length
extrapolation in addition, and (iii) addition in natural language context. For
(i) we train a small model on a small dataset (100M parameters and 300k
samples) with remarkable aptitude in (direct, no scratchpad) 15 digits
multiplication and essentially perfect up to 12 digits, while usual training in
this context would give a model failing at 4 digits multiplication. In the
experiments on addition, we use a mere 120k samples to demonstrate: for (ii)
extrapolation from 10 digits to testing on 12 digits numbers while usual
training would have no extrapolation, and for (iii) almost perfect accuracy up
to 5 digits while usual training would be correct only up to 3 digits (which is
essentially memorization with a training set of 120k samples).
Related papers
- Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia [55.23627698804683]
We study the scaling behavior of different numeral systems in the context of transformer-based large language models.
A base $10$ system is consistently more data-efficient than a base $102$ or $103$ system across training data scale.
We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations.
arXiv Detail & Related papers (2024-09-25T22:08:31Z) - Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks [27.020990219204343]
Large language models (LLMs) can correctly and confidently predict the first digit of n-digit by m-digit multiplication tasks.
LLMs in practice often fail to correctly or confidently predict the last digit of an n-digit by m-digit multiplication.
We show that the latter task can be solved more robustly when the LLM is conditioned on all of the correct higher-order digits.
arXiv Detail & Related papers (2024-06-04T14:34:39Z) - Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks [32.81985604969825]
We show that Transformers fail to generalize over length on basic arithmetic tasks such as addition and multiplication.
A major reason behind this failure is the vast difference in structure between numbers and text.
We propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings.
arXiv Detail & Related papers (2024-06-04T02:00:07Z) - Transformers Can Do Arithmetic with the Right Embeddings [75.66545271398704]
We show how to improve the performance of transformers on arithmetic tasks.
We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance.
These gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.
arXiv Detail & Related papers (2024-05-27T17:49:18Z) - Reverse That Number! Decoding Order Matters in Arithmetic Learning [49.5504492920404]
Our work introduces a novel strategy that reevaluates the digit order by prioritizing output from the least significant digit.
Compared to the previous state-of-the-art (SOTA) method, our findings reveal an overall improvement of in accuracy while requiring only a third of the tokens typically used during training.
arXiv Detail & Related papers (2024-03-09T09:04:53Z) - GPT Can Solve Mathematical Problems Without a Calculator [24.114064917059565]
We show that a large language model can accurately perform arithmetic operations with almost 100% accuracy without data leakage.
We also demonstrate that our MathGLM, fine-tuned from GLM-10B, achieves similar performance to GPT-4 on a 5,000-samples Chinese math problem test set.
arXiv Detail & Related papers (2023-09-06T06:18:16Z) - Prompt Consistency for Zero-Shot Task Generalization [118.81196556175797]
In this paper, we explore methods to utilize unlabeled data to improve zero-shot performance.
Specifically, we take advantage of the fact that multiple prompts can be used to specify a single task, and propose to regularize prompt consistency.
Our approach outperforms the state-of-the-art zero-shot learner, T0, on 9 out of 11 datasets across 4 NLP tasks by up to 10.6 absolute points in terms of accuracy.
arXiv Detail & Related papers (2022-04-29T19:18:37Z) - Train Short, Test Long: Attention with Linear Biases Enables Input
Length Extrapolation [62.51758040848735]
We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation.
ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance.
We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048.
arXiv Detail & Related papers (2021-08-27T17:35:06Z) - Investigating the Limitations of the Transformers with Simple Arithmetic
Tasks [10.23804850480924]
We find that how a number is represented in its surface form has a strong influence on the model's accuracy.
We conclude that modern pretrained language models can easily learn arithmetic from very few examples.
arXiv Detail & Related papers (2021-02-25T17:22:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.