Related papers: Teaching Arithmetic to Small Transformers

Teaching Arithmetic to Small Transformers

URL: http://arxiv.org/abs/2307.03381v1
Date: Fri, 7 Jul 2023 04:33:31 GMT
Title: Teaching Arithmetic to Small Transformers
Authors: Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos
Abstract summary: This study investigates how small transformers can efficiently learn arithmetic operations. We first demonstrate that conventional training data is not the most effective for arithmetic learning. We then train on chain-of-thought style data that includes intermediate step results.
Score: 39.72665384986095
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.

Related papers

What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples. We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer. We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z)
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
Arithmetic-Based Pretraining -- Improving Numeracy of Pretrained Language Models [67.48894919842576]
State-of-the-art pretrained language models tend to perform below their capabilities when applied out-of-the-box on tasks that require numeracy. We propose a new extended pretraining approach called Arithmetic-Based Pretraining that jointly addresses both in one extended pretraining step. Our experiments show the effectiveness of Arithmetic-Based Pretraining in three different tasks that require improved numeracy.
arXiv Detail & Related papers (2022-05-13T16:10:13Z)
More data or more parameters? Investigating the effect of data structure on generalization [17.249712222764085]
Properties of data impact the test error as a function of the number of training examples and number of training parameters. We show that noise in the labels and strong anisotropy of the input data play similar roles on the test error.
arXiv Detail & Related papers (2021-03-09T16:08:41Z)
Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks. We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.