Understanding Emergent Abilities of Language Models from the Loss Perspective
- URL: http://arxiv.org/abs/2403.15796v2
- Date: Sat, 30 Mar 2024 09:55:12 GMT
- Title: Understanding Emergent Abilities of Language Models from the Loss Perspective
- Authors: Zhengxiao Du, Aohan Zeng, Yuxiao Dong, Jie Tang,
- Abstract summary: We study emergent abilities in the lens of pre-training loss, instead of model size or training compute.
We discover that a model exhibits emergent abilities on certain tasks when its pre-training loss falls below a specific threshold.
This inspires us to redefine emergent abilities as those that manifest in models with lower pre-training losses.
- Score: 32.81782726603632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have put into question the belief that emergent abilities in language models are exclusive to large models. This skepticism arises from two observations: 1) smaller models can also exhibit high performance on emergent abilities and 2) there is doubt on the discontinuous metrics used to measure these abilities. In this paper, we propose to study emergent abilities in the lens of pre-training loss, instead of model size or training compute. We demonstrate that the models with the same pre-training loss, but different model and data sizes, generate the same performance on various downstream tasks. We also discover that a model exhibits emergent abilities on certain tasks -- regardless of the continuity of metrics -- when its pre-training loss falls below a specific threshold. Before reaching this threshold, its performance remains at the level of random guessing. This inspires us to redefine emergent abilities as those that manifest in models with lower pre-training losses, highlighting that these abilities cannot be predicted by merely extrapolating the performance trends of models with higher pre-training losses.
Related papers
- Effects of Scale on Language Model Robustness [7.725206196110384]
We show that adversarially trained larger models generalize faster and better to modified attacks not seen during training when compared with smaller models.
We also analyze the offense/defense balance of increasing compute, finding parity in some settings and an advantage for offense in others.
arXiv Detail & Related papers (2024-07-25T17:26:41Z) - Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications.
Memorisation is the causal effect of training with an instance on the model's ability to predict that instance.
This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z) - Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck [11.416426888383873]
We find that smaller models can suffer from saturation, characterized as a drop in performance at some advanced point in training followed by a plateau.
This can be explained by a mismatch between the hidden dimension of smaller models and the high rank of the target contextual probability distribution.
We measure the effect of the softmax bottleneck in various settings and find that models based on less than 1000 hidden dimensions tend to adopt degenerate latent representations in late pretraining.
arXiv Detail & Related papers (2024-04-11T11:10:36Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - Small-scale proxies for large-scale Transformer training instabilities [69.36381318171338]
We seek ways to reproduce and study training stability and instability at smaller scales.
By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates.
We study methods such as warm-up, weight decay, and the $mu$Param to train small models that achieve similar losses across orders of magnitude of learning rate variation.
arXiv Detail & Related papers (2023-09-25T17:48:51Z) - Are Emergent Abilities of Large Language Models a Mirage? [9.683505038585988]
Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models.
Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, emergent abilities appear due to the researcher's choice of metric.
Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance.
arXiv Detail & Related papers (2023-04-28T17:52:11Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Pathologies of Pre-trained Language Models in Few-shot Fine-tuning [50.3686606679048]
We show that pre-trained language models with few examples show strong prediction bias across labels.
Although few-shot fine-tuning can mitigate the prediction bias, our analysis shows models gain performance improvement by capturing non-task-related features.
These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior.
arXiv Detail & Related papers (2022-04-17T15:55:18Z) - FitVid: Overfitting in Pixel-Level Video Prediction [117.59339756506142]
We introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks.
FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
arXiv Detail & Related papers (2021-06-24T17:20:21Z) - Reducing Risk of Model Inversion Using Privacy-Guided Training [0.0]
Several recent attacks have been able to infer sensitive information from trained models.
We present a solution for countering model inversion attacks in tree-based models.
arXiv Detail & Related papers (2020-06-29T09:02:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.