SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models
- URL: http://arxiv.org/abs/2303.10464v2
- Date: Sat, 29 Jul 2023 19:56:50 GMT
- Title: SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models
- Authors: Vithursan Thangarasa, Abhay Gupta, William Marshall, Tianda Li, Kevin
Leong, Dennis DeCoste, Sean Lie, Shreyas Saxena
- Abstract summary: We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
- Score: 4.114555639014612
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The pre-training and fine-tuning paradigm has contributed to a number of
breakthroughs in Natural Language Processing (NLP). Instead of directly
training on a downstream task, language models are first pre-trained on large
datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then
fine-tuned on task-specific data (e.g., natural language generation, text
summarization, etc.). Scaling the model and dataset size has helped improve the
performance of LLMs, but unfortunately, this also lead to highly prohibitive
computational costs. Pre-training LLMs often require orders of magnitude more
FLOPs than fine-tuning and the model capacity often remains the same between
the two phases. To achieve training efficiency w.r.t training FLOPs, we propose
to decouple the model capacity between the two phases and introduce Sparse
Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits
of using unstructured weight sparsity to train only a subset of weights during
pre-training (Sparse Pre-training) and then recover the representational
capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We
demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3
XL model resulting in a 2.5x reduction in pre-training FLOPs, without a
significant loss in accuracy on the downstream tasks relative to the dense
baseline. By rigorously evaluating multiple downstream tasks, we also establish
a relationship between sparsity, task complexity and dataset size. Our work
presents a promising direction to train large GPT models at a fraction of the
training FLOPs using weight sparsity, while retaining the benefits of
pre-trained textual representations for downstream tasks.
Related papers
- An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of
Language Models [40.54353850357839]
We show how we can employ submodular optimization to select highly representative subsets of the training corpora.
We show that the resulting models achieve up to $sim99%$ of the performance of the fully-trained models.
arXiv Detail & Related papers (2023-05-11T09:24:41Z) - Parameter-Efficient Sparsity for Large Language Models Fine-Tuning [63.321205487234074]
We propose a.
sparse-efficient Sparse Training (PST) method to reduce the number of trainable parameters during sparse-aware training.
Experiments with diverse networks (i.e., BERT, RoBERTa and GPT-2) demonstrate PST performs on par or better than previous sparsity methods.
arXiv Detail & Related papers (2022-05-23T02:43:45Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient
Framework [10.656788279434798]
We propose a simple and efficient learning framework, TLM, that does not rely on large-scale pretraining.
On eight classification datasets in four domains, TLM achieves results better than or similar to pretrained language models.
arXiv Detail & Related papers (2021-11-07T17:13:59Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.