Benchmarking down-scaled (not so large) pre-trained language models
- URL: http://arxiv.org/abs/2105.04876v1
- Date: Tue, 11 May 2021 09:01:04 GMT
- Title: Benchmarking down-scaled (not so large) pre-trained language models
- Authors: M. A{\ss}enmacher, P. Schulze, C. Heumann
- Abstract summary: Large Transformer-based language models are pre-trained on corpora of varying sizes, for a different number of steps and with different batch sizes.
We compare three pre-training objectives for different shape parameters and model sizes, while also varying the number of pre-training steps and the batch size.
In our experiments NSP +BERT-style consistently outperforms (RoBERTa-style) as well as the standard LM objective.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Transformer-based language models are pre-trained on corpora of varying
sizes, for a different number of steps and with different batch sizes. At the
same time, more fundamental components, such as the pre-training objective or
architectural hyperparameters, are modified. In total, it is therefore
difficult to ascribe changes in performance to specific factors. Since
searching the hyperparameter space over the full systems is too costly, we
pre-train down-scaled versions of several popular Transformer-based
architectures on a common pre-training corpus and benchmark them on a subset of
the GLUE tasks (Wang et al., 2018). Specifically, we systematically compare
three pre-training objectives for different shape parameters and model sizes,
while also varying the number of pre-training steps and the batch size. In our
experiments MLM + NSP (BERT-style) consistently outperforms MLM (RoBERTa-style)
as well as the standard LM objective. Furthermore, we find that additional
compute should be mainly allocated to an increased model size, while training
for more steps is inefficient. Based on these observations, as a final step we
attempt to scale up several systems using compound scaling (Tan and Le, 2019)
adapted to Transformer-based language models.
Related papers
- Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis [16.253898272659242]
This study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to feedforward networks (FFNs)
Experiments on the large RefinedWeb dataset show that low-rank parametrization is both efficient (e.g., 2.6$times$ FFN speed-up with 32% parameters) and effective during training.
Motivated by this finding, we develop the wide and structured networks surpassing the current medium-sized and large-sized Transformer in perplexity and throughput performance.
arXiv Detail & Related papers (2024-07-13T10:08:55Z) - Scaling-laws for Large Time-series Models [2.0671213754662343]
Time series forecasting shares a similar sequential structure to language, and is amenable to large-scale transformer architectures.
We show that foundational decoder-only time series transformer models exhibit analogous scaling-behavior to LLMs.
We assemble a large corpus of heterogenous time series data on which to train, and establish, for the first time, power-law scaling relations with respect to parameter count, dataset size, and training compute.
arXiv Detail & Related papers (2024-05-22T17:48:17Z) - Timer: Generative Pre-trained Transformers Are Large Time Series Models [83.03091523806668]
This paper aims at the early development of large time series models (LTSM)
During pre-training, we curate large-scale datasets with up to 1 billion time points.
To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task.
arXiv Detail & Related papers (2024-02-04T06:55:55Z) - Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient.
We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data [66.11139091362078]
We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics.
Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
arXiv Detail & Related papers (2022-02-06T20:07:35Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity [35.84448624327473]
We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs.
We show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats.
We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.
arXiv Detail & Related papers (2021-01-11T16:11:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.