Benchmarking down-scaled (not so large) pre-trained language models
- URL: http://arxiv.org/abs/2105.04876v1
- Date: Tue, 11 May 2021 09:01:04 GMT
- Title: Benchmarking down-scaled (not so large) pre-trained language models
- Authors: M. A{\ss}enmacher, P. Schulze, C. Heumann
- Abstract summary: Large Transformer-based language models are pre-trained on corpora of varying sizes, for a different number of steps and with different batch sizes.
We compare three pre-training objectives for different shape parameters and model sizes, while also varying the number of pre-training steps and the batch size.
In our experiments NSP +BERT-style consistently outperforms (RoBERTa-style) as well as the standard LM objective.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Transformer-based language models are pre-trained on corpora of varying
sizes, for a different number of steps and with different batch sizes. At the
same time, more fundamental components, such as the pre-training objective or
architectural hyperparameters, are modified. In total, it is therefore
difficult to ascribe changes in performance to specific factors. Since
searching the hyperparameter space over the full systems is too costly, we
pre-train down-scaled versions of several popular Transformer-based
architectures on a common pre-training corpus and benchmark them on a subset of
the GLUE tasks (Wang et al., 2018). Specifically, we systematically compare
three pre-training objectives for different shape parameters and model sizes,
while also varying the number of pre-training steps and the batch size. In our
experiments MLM + NSP (BERT-style) consistently outperforms MLM (RoBERTa-style)
as well as the standard LM objective. Furthermore, we find that additional
compute should be mainly allocated to an increased model size, while training
for more steps is inefficient. Based on these observations, as a final step we
attempt to scale up several systems using compound scaling (Tan and Le, 2019)
adapted to Transformer-based language models.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Scaling Law for Language Models Training Considering Batch Size [17.09348741898811]
Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress.
We empirically investigate how a critical hyper- parameter, i.e., the global batch size, influences the LLM training prdocess.
We establish a basic scaling law on model size and training data amount.
We then examine how varying batch sizes and learning rates affect the convergence and generalization of these models.
arXiv Detail & Related papers (2024-12-02T13:58:35Z) - Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis [16.253898272659242]
This study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to feedforward networks (FFNs)
Experiments on the large RefinedWeb dataset show that low-rank parametrization is both efficient (e.g., 2.6$times$ FFN speed-up with 32% parameters) and effective during training.
Motivated by this finding, we develop the wide and structured networks surpassing the current medium-sized and large-sized Transformer in perplexity and throughput performance.
arXiv Detail & Related papers (2024-07-13T10:08:55Z) - Scaling-laws for Large Time-series Models [2.0671213754662343]
Time series forecasting shares a similar sequential structure to language, and is amenable to large-scale transformer architectures.
We show that foundational decoder-only time series transformer models exhibit analogous scaling-behavior to LLMs.
We establish for the first time power-law scaling with parameter count, dataset size, and training compute, spanning five orders of magnitude.
arXiv Detail & Related papers (2024-05-22T17:48:17Z) - Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient.
We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data [66.11139091362078]
We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics.
Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
arXiv Detail & Related papers (2022-02-06T20:07:35Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity [35.84448624327473]
We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs.
We show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats.
We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.
arXiv Detail & Related papers (2021-01-11T16:11:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.