Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale
- URL: http://arxiv.org/abs/2305.17266v2
- Date: Tue, 30 May 2023 18:37:32 GMT
- Title: Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale
- Authors: Vijeta Deshpande, Dan Pechi, Shree Thatte, Vladislav Lialin, Anna
Rumshisky
- Abstract summary: We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters.
We examine downscaling effects, extending scaling laws to models as small as 1M parameters.
- Score: 5.759319006531332
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, language models have drastically grown in size, and the
abilities of these models have been shown to improve with scale. The majority
of recent scaling laws studies focused on high-compute high-parameter count
settings, leaving the question of when these abilities begin to emerge largely
unanswered. In this paper, we investigate whether the effects of pre-training
can be observed when the problem size is reduced, modeling a smaller,
reduced-vocabulary language. We show the benefits of pre-training with masked
language modeling (MLM) objective in models as small as 1.25M parameters, and
establish a strong correlation between pre-training perplexity and downstream
performance (GLUE benchmark). We examine downscaling effects, extending scaling
laws to models as small as ~1M parameters. At this scale, we observe a break of
the power law for compute-optimal models and show that the MLM loss does not
scale smoothly with compute-cost (FLOPs) below $2.2 \times 10^{15}$ FLOPs. We
also find that adding layers does not always benefit downstream performance.
Related papers
- Tending Towards Stability: Convergence Challenges in Small Language Models [3.734405405403176]
Despite their advantages, smaller models frequently underperform compared to their larger counterparts.
This is anecdotally attributed to their reduced representational capacity.
By linking the convergence of layers' activations to their parameters' effective rank, our analyses can guide future work to address inefficiencies in the learning dynamics of small models.
arXiv Detail & Related papers (2024-10-15T09:57:19Z) - Large Language Model Pruning [0.0]
We suggest a model pruning technique specifically focused on LLMs.
The proposed methodology emphasizes the explainability of deep learning models.
We also explore the difference between pruning on large-scale models vs. pruning on small-scale models.
arXiv Detail & Related papers (2024-05-24T18:22:15Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model [4.6373877301731]
We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs)
We test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone.
The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models.
arXiv Detail & Related papers (2024-03-29T21:32:50Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models.
Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z) - nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales [65.01417261415833]
We present an approach to predict the pre-training loss based on our observations that Maximal Update Parametrization (muP) enables accurate fitting of scaling laws.
With around 14% of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B.
Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models.
arXiv Detail & Related papers (2023-04-14T00:45:01Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z) - Scaling Laws for Acoustic Models [7.906034575114518]
Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships.
We show that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws.
arXiv Detail & Related papers (2021-06-11T18:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.