Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
- URL: http://arxiv.org/abs/2212.05055v1
- Date: Fri, 9 Dec 2022 18:57:37 GMT
- Title: Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
- Authors: Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme
Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, Neil Houlsby
- Abstract summary: We propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint.
We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet.
- Score: 59.39280540478479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training large, deep neural networks to convergence can be prohibitively
expensive. As a result, often only a small selection of popular, dense models
are reused across different contexts and tasks. Increasingly, sparsely
activated models, which seek to decouple model size from computation costs, are
becoming an attractive alternative to dense models. Although more efficient in
terms of quality and computation cost, sparse models remain data-hungry and
costly to train from scratch in the large scale regime. In this work, we
propose sparse upcycling -- a simple way to reuse sunk training costs by
initializing a sparsely activated Mixture-of-Experts model from a dense
checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language
models and Vision Transformer Base and Large models, respectively,
significantly outperform their dense counterparts on SuperGLUE and ImageNet,
using only ~50% of the initial dense pretraining sunk cost. The upcycled models
also outperform sparse models trained from scratch on 100% of the initial dense
pretraining computation budget.
Related papers
- FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models [35.40065954148091]
FINE is a method based on the Learngene framework to initializing downstream networks leveraging pre-trained models.
It decomposes pre-trained knowledge into the product of matrices (i.e., $U$, $Sigma$, and $V$), where $U$ and $V$ are shared across network blocks as learngenes''
It consistently outperforms direct pre-training, particularly for smaller models, achieving state-of-the-art results across variable model sizes.
arXiv Detail & Related papers (2024-09-28T08:57:17Z) - Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
We propose a pruning pipeline for semi-structured sparse models via retraining, termed Adaptive Sparse Trainer (AST)
AST transforms dense models into sparse ones by applying decay to masked weights while allowing the model to adaptively select masks throughout the training process.
Our work demonstrates the feasibility of deploying semi-structured sparse large language models and introduces a novel method for achieving highly compressed models.
arXiv Detail & Related papers (2024-07-30T06:33:44Z) - Initializing Models with Larger Ones [76.41561758293055]
We introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model.
Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time.
arXiv Detail & Related papers (2023-11-30T18:58:26Z) - LEMON: Lossless model expansion [43.40389747029802]
Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance.
We present $textbfL$ossl$textbfE$ss $textbfMO$del Expansio$textbfN$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts.
We show that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.
arXiv Detail & Related papers (2023-10-12T03:02:41Z) - $\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained
Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time.
We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies.
Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z) - Transfer training from smaller language model [6.982133308738434]
We find a method to save training time and resource cost by changing the small well-trained model to large model.
We test the target model on several data sets and find it is still comparable with the source model.
arXiv Detail & Related papers (2021-04-23T02:56:02Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.