Related papers: Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale

Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale

URL: http://arxiv.org/abs/2305.17266v2
Date: Tue, 30 May 2023 18:37:32 GMT
Title: Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale
Authors: Vijeta Deshpande, Dan Pechi, Shree Thatte, Vladislav Lialin, Anna Rumshisky
Abstract summary: We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters. We examine downscaling effects, extending scaling laws to models as small as 1M parameters.
Score: 5.759319006531332
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recent years, language models have drastically grown in size, and the abilities of these models have been shown to improve with scale. The majority of recent scaling laws studies focused on high-compute high-parameter count settings, leaving the question of when these abilities begin to emerge largely unanswered. In this paper, we investigate whether the effects of pre-training can be observed when the problem size is reduced, modeling a smaller, reduced-vocabulary language. We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters, and establish a strong correlation between pre-training perplexity and downstream performance (GLUE benchmark). We examine downscaling effects, extending scaling laws to models as small as ~1M parameters. At this scale, we observe a break of the power law for compute-optimal models and show that the MLM loss does not scale smoothly with compute-cost (FLOPs) below $2.2 \times 10^{15}$ FLOPs. We also find that adding layers does not always benefit downstream performance.

Related papers

Scaling Inference-Efficient Language Models [3.271571137474847]
We show that model architecture affects inference latency, where models of the same size can have up to 3.5x difference in latency. We modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. We release the Morph-1B model, which improves inference latency by 1.8x while maintaining accuracy on downstream tasks compared to open-source models.
arXiv Detail & Related papers (2025-01-30T03:16:44Z)
Scaling Law for Language Models Training Considering Batch Size [17.09348741898811]
Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. We empirically investigate how a critical hyper- parameter, i.e., the global batch size, influences the LLM training prdocess. We establish a basic scaling law on model size and training data amount. We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models.
arXiv Detail & Related papers (2024-12-02T13:58:35Z)
Tending Towards Stability: Convergence Challenges in Small Language Models [3.734405405403176]
Despite their advantages, smaller models frequently underperform compared to their larger counterparts. This is anecdotally attributed to their reduced representational capacity. By linking the convergence of layers' activations to their parameters' effective rank, our analyses can guide future work to address inefficiencies in the learning dynamics of small models.
arXiv Detail & Related papers (2024-10-15T09:57:19Z)
Large Language Model Pruning [0.0]
We suggest a model pruning technique specifically focused on LLMs. The proposed methodology emphasizes the explainability of deep learning models. We also explore the difference between pruning on large-scale models vs. pruning on small-scale models.
arXiv Detail & Related papers (2024-05-24T18:22:15Z)
Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models. We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models. We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z)
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model [4.6373877301731]
We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs) We test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models.
arXiv Detail & Related papers (2024-03-29T21:32:50Z)
Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs. However, there remain gaps between current studies and how language models are trained. In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z)
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z)
nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales [65.01417261415833]
We present an approach to predict the pre-training loss based on our observations that Maximal Update Parametrization (muP) enables accurate fitting of scaling laws. With around 14% of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B. Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models.
arXiv Detail & Related papers (2023-04-14T00:45:01Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
Scaling Laws for Acoustic Models [7.906034575114518]
Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships. We show that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws.
arXiv Detail & Related papers (2021-06-11T18:59:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.