Inverse scaling can become U-shaped
- URL: http://arxiv.org/abs/2211.02011v5
- Date: Wed, 24 May 2023 06:55:50 GMT
- Title: Inverse scaling can become U-shaped
- Authors: Jason Wei, Najoung Kim, Yi Tay, Quoc V. Le
- Abstract summary: Scaling up language models has been empirically shown to improve performance on a wide range of downstream tasks.
This paper takes a closer look at these inverse scaling tasks.
We evaluate models of up to 540B parameters, trained on five times more compute than those evaluated in the Inverse Scaling Prize.
- Score: 126.64521446943155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling up language models has been empirically shown to improve performance
on a wide range of downstream tasks. However, if we were to observe worse
performance as a function of scale ("inverse scaling") on certain tasks, this
would indicate that scaling can also encourage behaviors that are misaligned
with human preferences. The Inverse Scaling Prize (McKenzie et al. 2022)
identified eleven such inverse scaling tasks, evaluated on models of up to 280B
parameters and up to 500 zettaFLOPs of training compute. This paper takes a
closer look at these inverse scaling tasks. We evaluate models of up to 540B
parameters, trained on five times more compute than those evaluated in the
Inverse Scaling Prize. With this increased range of model sizes and training
compute, only four out of the eleven tasks remain inverse scaling. Six out of
the eleven tasks exhibit "U-shaped scaling", where performance decreases up to
a certain size, and then increases again up to the largest model evaluated (the
one remaining task displays positive scaling). In addition, we find that 1-shot
examples and chain-of-thought can help mitigate undesirable scaling patterns
even further. U-shaped scaling suggests that the inverse scaling trend observed
in McKenzie et al. (2022) may not continue to hold for larger models, which we
attribute to the presence of distractor tasks that only sufficiently large
models can avoid.
Related papers
- Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 80 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Inverse Scaling: When Bigger Isn't Better [80.42834197416444]
Large language models (LMs) show predictable improvements to overall loss with increased scale.
We present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale.
arXiv Detail & Related papers (2023-06-15T20:11:23Z) - Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language
Models [92.11542797811461]
We introduce NeQA, a dataset consisting of questions with negation.
We show that this task can exhibit inverse scaling, U-shaped scaling, or positive scaling.
We find that task 1 has linear scaling, while task 2 has sigmoid-shaped scaling with an emergent transition point.
arXiv Detail & Related papers (2023-05-27T00:07:17Z) - Emergent inabilities? Inverse scaling over the course of pretraining [0.6091702876917281]
We investigate whether, over the course of training, the performance of language models at specific tasks can decrease while general performance remains high.
We find that for two tasks from the Inverse Scaling Challenge - quote-repetition and redefine-math - this is indeed the case.
This highlights the importance of testing model performance at all relevant benchmarks any time they are trained on additional data, even if their overall performance improves.
arXiv Detail & Related papers (2023-05-24T03:42:43Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development.
We find that scaling laws emerge at finetuning time in some NLP tasks.
For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z) - Scaling Laws for Acoustic Models [7.906034575114518]
Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships.
We show that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws.
arXiv Detail & Related papers (2021-06-11T18:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.