Inverse scaling can become U-shaped
- URL: http://arxiv.org/abs/2211.02011v5
- Date: Wed, 24 May 2023 06:55:50 GMT
- Title: Inverse scaling can become U-shaped
- Authors: Jason Wei, Najoung Kim, Yi Tay, Quoc V. Le
- Abstract summary: Scaling up language models has been empirically shown to improve performance on a wide range of downstream tasks.
This paper takes a closer look at these inverse scaling tasks.
We evaluate models of up to 540B parameters, trained on five times more compute than those evaluated in the Inverse Scaling Prize.
- Score: 126.64521446943155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling up language models has been empirically shown to improve performance
on a wide range of downstream tasks. However, if we were to observe worse
performance as a function of scale ("inverse scaling") on certain tasks, this
would indicate that scaling can also encourage behaviors that are misaligned
with human preferences. The Inverse Scaling Prize (McKenzie et al. 2022)
identified eleven such inverse scaling tasks, evaluated on models of up to 280B
parameters and up to 500 zettaFLOPs of training compute. This paper takes a
closer look at these inverse scaling tasks. We evaluate models of up to 540B
parameters, trained on five times more compute than those evaluated in the
Inverse Scaling Prize. With this increased range of model sizes and training
compute, only four out of the eleven tasks remain inverse scaling. Six out of
the eleven tasks exhibit "U-shaped scaling", where performance decreases up to
a certain size, and then increases again up to the largest model evaluated (the
one remaining task displays positive scaling). In addition, we find that 1-shot
examples and chain-of-thought can help mitigate undesirable scaling patterns
even further. U-shaped scaling suggests that the inverse scaling trend observed
in McKenzie et al. (2022) may not continue to hold for larger models, which we
attribute to the presence of distractor tasks that only sufficiently large
models can avoid.
Related papers
- A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models [1.14179290793997]
Large language models (LLMs) have been shown to exhibit emergent abilities in some downstream tasks.
We observe U-shaped scaling for hard questions, and inverted-U scaling followed by steady improvement for easy questions.
We propose a simple yet effective pipeline, called Slice-and-Sandwich, to predict both the emergence threshold and model performance beyond the threshold.
arXiv Detail & Related papers (2024-10-02T16:03:49Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Inverse Scaling: When Bigger Isn't Better [80.42834197416444]
Large language models (LMs) show predictable improvements to overall loss with increased scale.
We present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale.
arXiv Detail & Related papers (2023-06-15T20:11:23Z) - Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language
Models [92.11542797811461]
We introduce NeQA, a dataset consisting of questions with negation.
We show that this task can exhibit inverse scaling, U-shaped scaling, or positive scaling.
We find that task 1 has linear scaling, while task 2 has sigmoid-shaped scaling with an emergent transition point.
arXiv Detail & Related papers (2023-05-27T00:07:17Z) - Emergent inabilities? Inverse scaling over the course of pretraining [0.6091702876917281]
We investigate whether, over the course of training, the performance of language models at specific tasks can decrease while general performance remains high.
We find that for two tasks from the Inverse Scaling Challenge - quote-repetition and redefine-math - this is indeed the case.
This highlights the importance of testing model performance at all relevant benchmarks any time they are trained on additional data, even if their overall performance improves.
arXiv Detail & Related papers (2023-05-24T03:42:43Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development.
We find that scaling laws emerge at finetuning time in some NLP tasks.
For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.