MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining
- URL: http://arxiv.org/abs/2312.17482v2
- Date: Tue, 16 Jan 2024 16:03:31 GMT
- Title: MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining
- Authors: Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla,
Moin Nadeem, Nikhil Sardana, Daya Khudia, Jonathan Frankle
- Abstract summary: We introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining.
When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20.
This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models.
- Score: 10.421048804389343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although BERT-style encoder models are heavily used in NLP research, many
researchers do not pretrain their own BERTs from scratch due to the high cost
of training. In the past half-decade since BERT first rose to prominence, many
advances have been made with other transformer architectures and training
configurations that have yet to be systematically incorporated into BERT. Here,
we introduce MosaicBERT, a BERT-style encoder architecture and training recipe
that is empirically optimized for fast pretraining. This efficient architecture
incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear
Units (GLU), a module to dynamically remove padded tokens, and low precision
LayerNorm into the classic transformer encoder block. The training recipe
includes a 30% masking ratio for the Masked Language Modeling (MLM) objective,
bfloat16 precision, and vocabulary size optimized for GPU throughput, in
addition to best-practices from RoBERTa and other encoder models. When
pretrained from scratch on the C4 dataset, this base model achieves a
downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs
at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed
Pareto curves and show that MosaicBERT base and large are consistently Pareto
optimal when compared to a competitive BERT base and large. This empirical
speed up in pretraining enables researchers and engineers to pretrain custom
BERT-style models at low cost instead of finetune on existing generic models.
We open source our model weights and code.
Related papers
- BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining [0.5919433278490629]
BERT (Bidirectional Representations from Transformers) has revolutionized the field of natural language processing through its exceptional performance on numerous tasks.
DeBERTa introduced an enhanced decoder adapted for BERT's encoder model for pretraining, proving to be highly effective.
We argue that the design and research around enhanced masked language modeling decoders have been underappreciated.
arXiv Detail & Related papers (2024-01-29T03:25:11Z) - Asymmetric Masked Distillation for Pre-Training Small Foundation Models [52.56257450614992]
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding.
This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks.
We propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding.
arXiv Detail & Related papers (2023-11-06T14:44:34Z) - NarrowBERT: Accelerating Masked Language Model Pretraining and Inference [50.59811343945605]
We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2times$.
NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining.
We show that NarrowBERT increases the throughput at inference time by as much as $3.5times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI.
arXiv Detail & Related papers (2023-01-11T23:45:50Z) - Pretraining Without Attention [114.99187017618408]
This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs)
BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation.
arXiv Detail & Related papers (2022-12-20T18:50:08Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Prune Once for All: Sparse Pre-Trained Language Models [0.6063525456640462]
We present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation.
These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern.
We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss.
arXiv Detail & Related papers (2021-11-10T15:52:40Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.