Emergent inabilities? Inverse scaling over the course of pretraining
- URL: http://arxiv.org/abs/2305.14681v1
- Date: Wed, 24 May 2023 03:42:43 GMT
- Title: Emergent inabilities? Inverse scaling over the course of pretraining
- Authors: James A. Michaelov, Benjamin K. Bergen
- Abstract summary: We investigate whether, over the course of training, the performance of language models at specific tasks can decrease while general performance remains high.
We find that for two tasks from the Inverse Scaling Challenge - quote-repetition and redefine-math - this is indeed the case.
This highlights the importance of testing model performance at all relevant benchmarks any time they are trained on additional data, even if their overall performance improves.
- Score: 0.6091702876917281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Does inverse scaling only occur as a function of model parameter size, or can
it also occur over the course of training? We carry out an exploratory study
investigating whether, over the course of training on the language modeling
task, the performance of language models at specific tasks can decrease while
general performance remains high. We find that for two tasks from the Inverse
Scaling Challenge - quote-repetition and redefine-math - this is indeed the
case. Specifically, we find that for Pythia (Biderman et al., 2023) models with
a higher number of parameters, performance decreases over the course of
training at these two tasks, despite these models showing standard (positive)
scaling overall. This highlights the importance of testing model performance at
all relevant benchmarks any time they are trained on additional data, even if
their overall performance improves.
Related papers
- How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics [17.086867242274813]
We analyse how performance develops as a function of model characteristics like number of parameters, or type of training.
We find that while there is a clear relationship between number of parameters and performance, there is still a wide spread of performance points within a given size bracket.
We also find a certain degree of unpredictability about performance across access methods, possible due to unexposed sampling parameters.
arXiv Detail & Related papers (2024-06-20T07:17:09Z) - Understanding Emergent Abilities of Language Models from the Loss Perspective [32.81782726603632]
We study emergent abilities in the lens of pre-training loss, instead of model size or training compute.
We discover that a model exhibits emergent abilities on certain tasks when its pre-training loss falls below a specific threshold.
This inspires us to redefine emergent abilities as those that manifest in models with lower pre-training losses.
arXiv Detail & Related papers (2024-03-23T11:03:31Z) - Inverse Scaling: When Bigger Isn't Better [80.42834197416444]
Large language models (LMs) show predictable improvements to overall loss with increased scale.
We present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale.
arXiv Detail & Related papers (2023-06-15T20:11:23Z) - Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language
Models [92.11542797811461]
We introduce NeQA, a dataset consisting of questions with negation.
We show that this task can exhibit inverse scaling, U-shaped scaling, or positive scaling.
We find that task 1 has linear scaling, while task 2 has sigmoid-shaped scaling with an emergent transition point.
arXiv Detail & Related papers (2023-05-27T00:07:17Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Inverse scaling can become U-shaped [126.64521446943155]
Scaling up language models has been empirically shown to improve performance on a wide range of downstream tasks.
This paper takes a closer look at these inverse scaling tasks.
We evaluate models of up to 540B parameters, trained on five times more compute than those evaluated in the Inverse Scaling Prize.
arXiv Detail & Related papers (2022-11-03T17:26:44Z) - Numerical reasoning in machine reading comprehension tasks: are we there
yet? [79.07883990966077]
Numerical reasoning based machine reading comprehension is a task that involves reading comprehension along with using arithmetic operations such as addition, subtraction, sorting, and counting.
The DROP benchmark is a recent dataset that has inspired the design of NLP models aimed at solving this task.
The current standings of these models in the DROP leaderboard, over standard metrics, suggest that the models have achieved near-human performance.
arXiv Detail & Related papers (2021-09-16T20:13:56Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.