Length Generalization in Arithmetic Transformers
- URL: http://arxiv.org/abs/2306.15400v1
- Date: Tue, 27 Jun 2023 11:53:25 GMT
- Title: Length Generalization in Arithmetic Transformers
- Authors: Samy Jelassi, St\'ephane d'Ascoli, Carles Domingo-Enrich, Yuhuai Wu,
Yuanzhi Li, Fran\c{c}ois Charton
- Abstract summary: We show how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training.
We propose train set priming: adding a few ($10$ to $50$) long sequences to the training set.
We show that priming allows models trained on $5$-digit $times$ $3$-digit multiplications to generalize to $35times 3$ examples.
- Score: 41.62455986786115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We examine how transformers cope with two challenges: learning basic integer
arithmetic, and generalizing to longer sequences than seen during training. We
find that relative position embeddings enable length generalization for simple
tasks, such as addition: models trained on $5$-digit numbers can perform
$15$-digit sums. However, this method fails for multiplication, and we propose
train set priming: adding a few ($10$ to $50$) long sequences to the training
set. We show that priming allows models trained on $5$-digit $\times$ $3$-digit
multiplications to generalize to $35\times 3$ examples. We also show that
models can be primed for different generalization lengths, and that the priming
sample size scales as the logarithm of the training set size. Finally, we
discuss potential applications of priming beyond arithmetic.
Related papers
- Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count [19.148785141454642]
Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training.
In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers.
arXiv Detail & Related papers (2024-10-21T08:49:51Z) - Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia [55.23627698804683]
We study the scaling behavior of different numeral systems in the context of transformer-based large language models.
A base $10$ system is consistently more data-efficient than a base $102$ or $103$ system across training data scale.
We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations.
arXiv Detail & Related papers (2024-09-25T22:08:31Z) - Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding.
We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z) - Positional Description Matters for Transformers Arithmetic [58.4739272381373]
Transformers often falter on arithmetic tasks despite their vast capabilities.
We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
arXiv Detail & Related papers (2023-11-22T00:31:01Z) - Improving Length-Generalization in Transformers via Task Hinting [42.95479331339189]
In particular, the performance of a transformer model trained on tasks up to a certain length drops sharply when applied to longer instances of the same problem.
This work proposes an approach based on task hinting towards addressing length generalization.
arXiv Detail & Related papers (2023-10-01T16:57:40Z) - Learning the greatest common divisor: explaining transformer predictions [8.430481660019451]
The predictions of small transformers can be fully characterized by looking at model inputs and outputs.
The model learns a list $mathcal D$ of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of $mathcal D$ that divides both inputs.
arXiv Detail & Related papers (2023-08-29T19:38:41Z) - Primer: Searching for Efficient Transformers for Language Modeling [79.2677566332444]
Training and inference costs of large Transformer models have grown rapidly and become expensive.
Here we aim to reduce the costs of Transformers by searching for a more efficient variant.
We identify an architecture, named Primer, that has a smaller training cost than the original Transformer.
arXiv Detail & Related papers (2021-09-17T17:50:39Z) - Compressing 1D Time-Channel Separable Convolutions using Sparse Random
Ternary Matrices [65.4388266814055]
We replace 1x1-convolutions in 1D time-channel separable convolutions with constant, sparse random ternary matrices with weights in $-1,0,+1$.
For command recognition on Google Speech Commands v1, we improve the state-of-the-art accuracy from $97.21%$ to $97.41%$ at the same network size.
For speech recognition on Librispeech, we half the number of weights to be trained while only sacrificing about $1%$ of the floating-point baseline's word error rate.
arXiv Detail & Related papers (2021-03-31T15:09:20Z) - Investigating the Limitations of the Transformers with Simple Arithmetic
Tasks [10.23804850480924]
We find that how a number is represented in its surface form has a strong influence on the model's accuracy.
We conclude that modern pretrained language models can easily learn arithmetic from very few examples.
arXiv Detail & Related papers (2021-02-25T17:22:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.