Related papers: Positional Description Matters for Transformers Arithmetic

Positional Description Matters for Transformers Arithmetic

URL: http://arxiv.org/abs/2311.14737v1
Date: Wed, 22 Nov 2023 00:31:01 GMT
Title: Positional Description Matters for Transformers Arithmetic
Authors: Ruoqi Shen, S\'ebastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, Yi Zhang
Abstract summary: Transformers often falter on arithmetic tasks despite their vast capabilities. We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
Score: 58.4739272381373
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities --which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herein, we delve deeper into the role of positional encoding, and propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently. We investigate the value of these modifications for three tasks: (i) classical multiplication, (ii) length extrapolation in addition, and (iii) addition in natural language context. For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication. In the experiments on addition, we use a mere 120k samples to demonstrate: for (ii) extrapolation from 10 digits to testing on 12 digits numbers while usual training would have no extrapolation, and for (iii) almost perfect accuracy up to 5 digits while usual training would be correct only up to 3 digits (which is essentially memorization with a training set of 120k samples).

Related papers

FoNE: Precise Single-Token Number Embeddings via Fourier Features [51.17846016593835]
We propose a novel method that maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. On 6-digit decimal addition, FoNE requires 64$times$ less data to achieve 99% accuracy than subword and digit-wise embeddings. FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication.
arXiv Detail & Related papers (2025-02-13T19:54:59Z)
Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia [55.23627698804683]
We study the scaling behavior of different numeral systems in the context of transformer-based large language models. A base $10$ system is consistently more data-efficient than a base $102$ or $103$ system across training data scale. We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations.
arXiv Detail & Related papers (2024-09-25T22:08:31Z)
Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks [27.020990219204343]
Large language models (LLMs) can correctly and confidently predict the first digit of n-digit by m-digit multiplication tasks. LLMs in practice often fail to correctly or confidently predict the last digit of an n-digit by m-digit multiplication. We show that the latter task can be solved more robustly when the LLM is conditioned on all of the correct higher-order digits.
arXiv Detail & Related papers (2024-06-04T14:34:39Z)
Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks [32.81985604969825]
We show that Transformers fail to generalize over length on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure between numbers and text. We propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings.
arXiv Detail & Related papers (2024-06-04T02:00:07Z)
Transformers Can Do Arithmetic with the Right Embeddings [75.66545271398704]
We show how to improve the performance of transformers on arithmetic tasks. We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance. These gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.
arXiv Detail & Related papers (2024-05-27T17:49:18Z)
Reverse That Number! Decoding Order Matters in Arithmetic Learning [49.5504492920404]
Our work introduces a novel strategy that reevaluates the digit order by prioritizing output from the least significant digit. Compared to the previous state-of-the-art (SOTA) method, our findings reveal an overall improvement of in accuracy while requiring only a third of the tokens typically used during training.
arXiv Detail & Related papers (2024-03-09T09:04:53Z)
GPT Can Solve Mathematical Problems Without a Calculator [24.114064917059565]
We show that a large language model can accurately perform arithmetic operations with almost 100% accuracy without data leakage. We also demonstrate that our MathGLM, fine-tuned from GLM-10B, achieves similar performance to GPT-4 on a 5,000-samples Chinese math problem test set.
arXiv Detail & Related papers (2023-09-06T06:18:16Z)
Prompt Consistency for Zero-Shot Task Generalization [118.81196556175797]
In this paper, we explore methods to utilize unlabeled data to improve zero-shot performance. Specifically, we take advantage of the fact that multiple prompts can be used to specify a single task, and propose to regularize prompt consistency. Our approach outperforms the state-of-the-art zero-shot learner, T0, on 9 out of 11 datasets across 4 NLP tasks by up to 10.6 absolute points in terms of accuracy.
arXiv Detail & Related papers (2022-04-29T19:18:37Z)
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [62.51758040848735]
We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation. ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance. We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048.
arXiv Detail & Related papers (2021-08-27T17:35:06Z)
Investigating the Limitations of the Transformers with Simple Arithmetic Tasks [10.23804850480924]
We find that how a number is represented in its surface form has a strong influence on the model's accuracy. We conclude that modern pretrained language models can easily learn arithmetic from very few examples.
arXiv Detail & Related papers (2021-02-25T17:22:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.