Emergent inabilities? Inverse scaling over the course of pretraining
        - URL: http://arxiv.org/abs/2305.14681v1
- Date: Wed, 24 May 2023 03:42:43 GMT
- Title: Emergent inabilities? Inverse scaling over the course of pretraining
- Authors: James A. Michaelov, Benjamin K. Bergen
- Abstract summary: We investigate whether, over the course of training, the performance of language models at specific tasks can decrease while general performance remains high.
We find that for two tasks from the Inverse Scaling Challenge - quote-repetition and redefine-math - this is indeed the case.
This highlights the importance of testing model performance at all relevant benchmarks any time they are trained on additional data, even if their overall performance improves.
- Score: 0.6091702876917281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Does inverse scaling only occur as a function of model parameter size, or can
it also occur over the course of training? We carry out an exploratory study
investigating whether, over the course of training on the language modeling
task, the performance of language models at specific tasks can decrease while
general performance remains high. We find that for two tasks from the Inverse
Scaling Challenge - quote-repetition and redefine-math - this is indeed the
case. Specifically, we find that for Pythia (Biderman et al., 2023) models with
a higher number of parameters, performance decreases over the course of
training at these two tasks, despite these models showing standard (positive)
scaling overall. This highlights the importance of testing model performance at
all relevant benchmarks any time they are trained on additional data, even if
their overall performance improves.
 
      
        Related papers
        - Establishing Task Scaling Laws via Compute-Efficient Model Ladders [123.8193940110293]
 We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting.
We leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance.
 arXiv  Detail & Related papers  (2024-12-05T18:21:49Z)
- How Many Parameters Does it Take to Change a Light Bulb? Evaluating   Performance in Self-Play of Conversational Games as a Function of Model   Characteristics [17.086867242274813]
 We analyse how performance develops as a function of model characteristics like number of parameters, or type of training.
We find that while there is a clear relationship between number of parameters and performance, there is still a wide spread of performance points within a given size bracket.
We also find a certain degree of unpredictability about performance across access methods, possible due to unexposed sampling parameters.
 arXiv  Detail & Related papers  (2024-06-20T07:17:09Z)
- Understanding Emergent Abilities of Language Models from the Loss   Perspective [32.81782726603632]
 We study emergent abilities in the lens of pre-training loss, instead of model size or training compute.
We discover that a model exhibits emergent abilities on certain tasks when its pre-training loss falls below a specific threshold.
This inspires us to redefine emergent abilities as those that manifest in models with lower pre-training losses.
 arXiv  Detail & Related papers  (2024-03-23T11:03:31Z)
- Inverse Scaling: When Bigger Isn't Better [80.42834197416444]
 Large language models (LMs) show predictable improvements to overall loss with increased scale.
We present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale.
 arXiv  Detail & Related papers  (2023-06-15T20:11:23Z)
- Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language
  Models [92.11542797811461]
 We introduce NeQA, a dataset consisting of questions with negation.
We show that this task can exhibit inverse scaling, U-shaped scaling, or positive scaling.
We find that task 1 has linear scaling, while task 2 has sigmoid-shaped scaling with an emergent transition point.
 arXiv  Detail & Related papers  (2023-05-27T00:07:17Z)
- Training Trajectories of Language Models Across Scales [99.38721327771208]
 Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
 arXiv  Detail & Related papers  (2022-12-19T19:16:29Z)
- Inverse scaling can become U-shaped [126.64521446943155]
 Scaling up language models has been empirically shown to improve performance on a wide range of downstream tasks.
This paper takes a closer look at these inverse scaling tasks.
We evaluate models of up to 540B parameters, trained on five times more compute than those evaluated in the Inverse Scaling Prize.
 arXiv  Detail & Related papers  (2022-11-03T17:26:44Z)
- Numerical reasoning in machine reading comprehension tasks: are we there
  yet? [79.07883990966077]
 Numerical reasoning based machine reading comprehension is a task that involves reading comprehension along with using arithmetic operations such as addition, subtraction, sorting, and counting.
The DROP benchmark is a recent dataset that has inspired the design of NLP models aimed at solving this task.
The current standings of these models in the DROP leaderboard, over standard metrics, suggest that the models have achieved near-human performance.
 arXiv  Detail & Related papers  (2021-09-16T20:13:56Z)
- Exploring Strategies for Generalizable Commonsense Reasoning with
  Pre-trained Models [62.28551903638434]
 We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
 arXiv  Detail & Related papers  (2021-09-07T03:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.