Characterizing Learning Curves During Language Model Pre-Training:
Learning, Forgetting, and Stability
- URL: http://arxiv.org/abs/2308.15419v1
- Date: Tue, 29 Aug 2023 16:24:09 GMT
- Title: Characterizing Learning Curves During Language Model Pre-Training:
Learning, Forgetting, and Stability
- Authors: Tyler A. Chang, Zhuowen Tu, Benjamin K. Bergen
- Abstract summary: We observe that the language models generate short repetitive phrases before learning to generate longer and more coherent text.
We quantify the final surprisal, within-run variability, age of acquisition, forgettability, and cross-run variability of learning curves for individual tokens in context.
Our work contributes to a better understanding of language model pre-training dynamics and informs the deployment of stable language models in practice.
- Score: 28.68721131100346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How do language models learn to make predictions during pre-training? To
study this question, we extract learning curves from five autoregressive
English language model pre-training runs, for 1M tokens in context. We observe
that the language models generate short repetitive phrases before learning to
generate longer and more coherent text. We quantify the final surprisal,
within-run variability, age of acquisition, forgettability, and cross-run
variability of learning curves for individual tokens in context. More frequent
tokens reach lower final surprisals, exhibit less variability within and across
pre-training runs, are learned earlier, and are less likely to be "forgotten"
during pre-training. Higher n-gram probabilities further accentuate these
effects. Independent of the target token, shorter and more frequent contexts
correlate with marginally more stable and quickly acquired predictions. Effects
of part-of-speech are also small, although nouns tend to be acquired later and
less stably than verbs, adverbs, and adjectives. Our work contributes to a
better understanding of language model pre-training dynamics and informs the
deployment of stable language models in practice.
Related papers
- Frequency Explains the Inverse Correlation of Large Language Models'
Size, Training Data Amount, and Surprisal's Fit to Reading Times [15.738530737312335]
Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades.
This paper presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends.
The results indicate that Transformer-based language models' surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.
arXiv Detail & Related papers (2024-02-03T20:22:54Z) - MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z) - Uncertainty-Aware Unlikelihood Learning Improves Generative Aspect
Sentiment Quad Prediction [52.05304897163256]
We propose a template-agnostic method to control the token-level generation.
Specifically, we introduce Monte Carlo dropout to understand the built-in uncertainty of pre-trained language models.
We further propose marginalized unlikelihood learning to suppress the uncertainty-aware mistake tokens.
arXiv Detail & Related papers (2023-06-01T07:49:06Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - What do Large Language Models Learn beyond Language? [10.9650651784511]
We find that pretrained models significantly outperform comparable non-pretrained neural models.
Experiments surprisingly reveal that the positive effects of pre-training persist even when pretraining on multi-lingual text or computer code.
Our findings suggest a hitherto unexplored deep connection between pre-training and inductive learning abilities of language models.
arXiv Detail & Related papers (2022-10-21T23:43:13Z) - Token-wise Curriculum Learning for Neural Machine Translation [94.93133801641707]
Existing curriculum learning approaches to Neural Machine Translation (NMT) require sufficient sampling amounts of "easy" samples from training data at the early training stage.
We propose a novel token-wise curriculum learning approach that creates sufficient amounts of easy samples.
Our approach can consistently outperform baselines on 5 language pairs, especially for low-resource languages.
arXiv Detail & Related papers (2021-03-20T03:57:59Z) - Pretrained Language Model Embryology: The Birth of ALBERT [68.5801642674541]
We investigate the developmental process from a set of randomly parameters to a totipotent language model.
Our results show that ALBERT learns to reconstruct and predict tokens of different parts of speech (POS) in different learning speeds during pretraining.
These findings suggest that knowledge of a pretrained model varies during pretraining, and having more pretrain steps does not necessarily provide a model with more comprehensive knowledge.
arXiv Detail & Related papers (2020-10-06T05:15:39Z) - Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models
via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results.
We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks.
Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.