Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers
- URL: http://arxiv.org/abs/2002.11794v2
- Date: Tue, 23 Jun 2020 00:23:39 GMT
- Title: Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers
- Authors: Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan
Klein, Joseph E. Gonzalez
- Abstract summary: We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
- Score: 94.43313684188819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since hardware resources are limited, the objective of training deep learning
models is typically to maximize accuracy subject to the time and memory
constraints of training and inference. We study the impact of model size in
this setting, focusing on Transformer models for NLP tasks that are limited by
compute: self-supervised pretraining and high-resource machine translation. We
first show that even though smaller Transformer models execute faster per
iteration, wider and deeper models converge in significantly fewer steps.
Moreover, this acceleration in convergence typically outpaces the additional
computational overhead of using larger models. Therefore, the most
compute-efficient training strategy is to counterintuitively train extremely
large models but stop after a small number of iterations.
This leads to an apparent trade-off between the training efficiency of large
Transformer models and the inference efficiency of small Transformer models.
However, we show that large models are more robust to compression techniques
such as quantization and pruning than small models. Consequently, one can get
the best of both worlds: heavily compressed, large models achieve higher
accuracy than lightly compressed, small models.
Related papers
- A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - Model Compression and Efficient Inference for Large Language Models: A
Survey [20.199282252344396]
Large language models have two prominent characteristics compared to smaller models.
The most notable aspect of large models is the very high cost associated with model finetuning or training.
Large models emphasize versatility and generalization rather than performance on a single task.
arXiv Detail & Related papers (2024-02-15T06:58:30Z) - Weight subcloning: direct initialization of transformers using larger
pretrained ones [42.056148990349094]
We introduce a technique to transfer the knowledge of a pretrained model to smaller variants.
Weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models.
We achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
arXiv Detail & Related papers (2023-12-14T19:08:56Z) - Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - Quantization-Aware and Tensor-Compressed Training of Transformers for
Natural Language Understanding [12.030179065286928]
The paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and runtime latency of transformer-based models.
A layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer.
The performance is demonstrated in two natural language understanding tasks, showing up to $63times$ compression ratio, little accuracy loss and remarkable inference and training speedup.
arXiv Detail & Related papers (2023-06-01T18:32:08Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.