Cerebras-GPT: Open Compute-Optimal Language Models Trained on the
Cerebras Wafer-Scale Cluster
- URL: http://arxiv.org/abs/2304.03208v1
- Date: Thu, 6 Apr 2023 16:43:16 GMT
- Title: Cerebras-GPT: Open Compute-Optimal Language Models Trained on the
Cerebras Wafer-Scale Cluster
- Authors: Nolan Dey, Gurpreet Gosal, Zhiming (Charles) Chen, Hemant Khachane,
William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness
- Abstract summary: We introduce Cerebras-GPT, a family of open compute-optimal language models scaled from 111M to 13B parameters.
We characterize the predictable power-law scaling and compare Cerebras-GPT with other publicly-available models.
We release our pre-trained models and code, making this paper the first open and reproducible work comparing compute-optimal model scaling to models trained on fixed dataset sizes.
- Score: 0.14291940946857257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study recent research advances that improve large language models through
efficient pre-training and scaling, and open datasets and tools. We combine
these advances to introduce Cerebras-GPT, a family of open compute-optimal
language models scaled from 111M to 13B parameters. We train Cerebras-GPT
models on the Eleuther Pile dataset following DeepMind Chinchilla scaling rules
for efficient pre-training (highest accuracy for a given compute budget). We
characterize the predictable power-law scaling and compare Cerebras-GPT with
other publicly-available models to show all Cerebras-GPT models have
state-of-the-art training efficiency on both pre-training and downstream
objectives. We describe our learnings including how Maximal Update
Parameterization ($\mu$P) can further improve large model scaling, improving
accuracy and hyperparameter predictability at scale. We release our pre-trained
models and code, making this paper the first open and reproducible work
comparing compute-optimal model scaling to models trained on fixed dataset
sizes. Cerebras-GPT models are available on HuggingFace:
https://huggingface.co/cerebras.
Related papers
- More Compute Is What You Need [3.184416958830696]
We propose a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models.
We predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
arXiv Detail & Related papers (2024-04-30T12:05:48Z) - OpenELM: An Efficient Language Model Family with Open Training and Inference Framework [26.741510071520658]
We release OpenELM, a state-of-the-art open language model.
With a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo.
arXiv Detail & Related papers (2024-04-22T23:12:03Z) - Foundational GPT Model for MEG [3.524869467682149]
We propose two classes of deep learning foundational models that can be trained using forecasting of unlabelled brain signals.
First, we consider a modified Wavenet; and second, we consider a modified Transformer-based (GPT2) model.
We compare the performance of these deep learning models with standard linear autoregressive (AR) modelling on MEG data.
arXiv Detail & Related papers (2024-04-14T13:48:24Z) - Navigating Scaling Laws: Compute Optimality in Adaptive Model Training [39.96209967632896]
In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data.
We extend the concept of optimality by allowing for an adaptive' model, i.e. a model that can change its shape during training.
arXiv Detail & Related papers (2023-11-06T16:20:28Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.