Teaching Arithmetic to Small Transformers
- URL: http://arxiv.org/abs/2307.03381v1
- Date: Fri, 7 Jul 2023 04:33:31 GMT
- Title: Teaching Arithmetic to Small Transformers
- Authors: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris
Papailiopoulos
- Abstract summary: This study investigates how small transformers can efficiently learn arithmetic operations.
We first demonstrate that conventional training data is not the most effective for arithmetic learning.
We then train on chain-of-thought style data that includes intermediate step results.
- Score: 39.72665384986095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models like GPT-4 exhibit emergent capabilities across
general-purpose tasks, such as basic arithmetic, when trained on extensive text
data, even though these tasks are not explicitly encoded by the unsupervised,
next-token prediction objective. This study investigates how small
transformers, trained from random initialization, can efficiently learn
arithmetic operations such as addition, multiplication, and elementary
functions like square root, using the next-token prediction objective. We first
demonstrate that conventional training data is not the most effective for
arithmetic learning, and simple formatting changes can significantly improve
accuracy. This leads to sharp phase transitions as a function of training data
scale, which, in some cases, can be explained through connections to low-rank
matrix completion. Building on prior work, we then train on chain-of-thought
style data that includes intermediate step results. Even in the complete
absence of pretraining, this approach significantly and simultaneously improves
accuracy, sample complexity, and convergence speed. We also study the interplay
between arithmetic and text data during training and examine the effects of
few-shot prompting, pretraining, and model scale. Additionally, we discuss
length generalization challenges. Our work highlights the importance of
high-quality, instructive data that considers the particular characteristics of
the next-word prediction objective for rapidly eliciting arithmetic
capabilities.
Related papers
- What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples.
We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer.
We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.
This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Arithmetic-Based Pretraining -- Improving Numeracy of Pretrained
Language Models [67.48894919842576]
State-of-the-art pretrained language models tend to perform below their capabilities when applied out-of-the-box on tasks that require numeracy.
We propose a new extended pretraining approach called Arithmetic-Based Pretraining that jointly addresses both in one extended pretraining step.
Our experiments show the effectiveness of Arithmetic-Based Pretraining in three different tasks that require improved numeracy.
arXiv Detail & Related papers (2022-05-13T16:10:13Z) - More data or more parameters? Investigating the effect of data structure
on generalization [17.249712222764085]
Properties of data impact the test error as a function of the number of training examples and number of training parameters.
We show that noise in the labels and strong anisotropy of the input data play similar roles on the test error.
arXiv Detail & Related papers (2021-03-09T16:08:41Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.