Emergent inabilities? Inverse scaling over the course of pretraining
- URL: http://arxiv.org/abs/2305.14681v1
- Date: Wed, 24 May 2023 03:42:43 GMT
- Title: Emergent inabilities? Inverse scaling over the course of pretraining
- Authors: James A. Michaelov, Benjamin K. Bergen
- Abstract summary: We investigate whether, over the course of training, the performance of language models at specific tasks can decrease while general performance remains high.
We find that for two tasks from the Inverse Scaling Challenge - quote-repetition and redefine-math - this is indeed the case.
This highlights the importance of testing model performance at all relevant benchmarks any time they are trained on additional data, even if their overall performance improves.
- Score: 0.6091702876917281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Does inverse scaling only occur as a function of model parameter size, or can
it also occur over the course of training? We carry out an exploratory study
investigating whether, over the course of training on the language modeling
task, the performance of language models at specific tasks can decrease while
general performance remains high. We find that for two tasks from the Inverse
Scaling Challenge - quote-repetition and redefine-math - this is indeed the
case. Specifically, we find that for Pythia (Biderman et al., 2023) models with
a higher number of parameters, performance decreases over the course of
training at these two tasks, despite these models showing standard (positive)
scaling overall. This highlights the importance of testing model performance at
all relevant benchmarks any time they are trained on additional data, even if
their overall performance improves.
Related papers
- Establishing Task Scaling Laws via Compute-Efficient Model Ladders [123.8193940110293]
We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting.
We leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance.
arXiv Detail & Related papers (2024-12-05T18:21:49Z) - Understanding Emergent Abilities of Language Models from the Loss Perspective [32.81782726603632]
We study emergent abilities in the lens of pre-training loss, instead of model size or training compute.
We find that a model exhibits emergent abilities on certain tasks regardless of the continuity of metrics.
This inspires us to redefine emergent abilities as those that manifest in models with lower pre-training losses.
arXiv Detail & Related papers (2024-03-23T11:03:31Z) - Inverse Scaling: When Bigger Isn't Better [80.42834197416444]
Large language models (LMs) show predictable improvements to overall loss with increased scale.
We present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale.
arXiv Detail & Related papers (2023-06-15T20:11:23Z) - Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language
Models [92.11542797811461]
We introduce NeQA, a dataset consisting of questions with negation.
We show that this task can exhibit inverse scaling, U-shaped scaling, or positive scaling.
We find that task 1 has linear scaling, while task 2 has sigmoid-shaped scaling with an emergent transition point.
arXiv Detail & Related papers (2023-05-27T00:07:17Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Inverse scaling can become U-shaped [126.64521446943155]
Scaling up language models has been empirically shown to improve performance on a wide range of downstream tasks.
This paper takes a closer look at these inverse scaling tasks.
We evaluate models of up to 540B parameters, trained on five times more compute than those evaluated in the Inverse Scaling Prize.
arXiv Detail & Related papers (2022-11-03T17:26:44Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.