NarrowBERT: Accelerating Masked Language Model Pretraining and Inference
- URL: http://arxiv.org/abs/2301.04761v2
- Date: Mon, 5 Jun 2023 23:47:43 GMT
- Title: NarrowBERT: Accelerating Masked Language Model Pretraining and Inference
- Authors: Haoxin Li, Phillip Keung, Daniel Cheng, Jungo Kasai, Noah A. Smith
- Abstract summary: We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2times$.
NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining.
We show that NarrowBERT increases the throughput at inference time by as much as $3.5times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI.
- Score: 50.59811343945605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale language model pretraining is a very successful form of
self-supervised learning in natural language processing, but it is increasingly
expensive to perform as the models and pretraining corpora have become larger
over time. We propose NarrowBERT, a modified transformer encoder that increases
the throughput for masked language model pretraining by more than $2\times$.
NarrowBERT sparsifies the transformer model such that the self-attention
queries and feedforward layers only operate on the masked tokens of each
sentence during pretraining, rather than all of the tokens as with the usual
transformer encoder. We also show that NarrowBERT increases the throughput at
inference time by as much as $3.5\times$ with minimal (or no) performance
degradation on sentence encoding tasks like MNLI. Finally, we examine the
performance of NarrowBERT on the IMDB and Amazon reviews classification and
CoNLL NER tasks and show that it is also comparable to standard BERT
performance.
Related papers
- BPDec: Unveiling the Potential of Masked Language Modeling Decoder in BERT pretraining [0.5919433278490629]
BERT (Bidirectional Representations from Transformers) has revolutionized the field of natural language processing through its exceptional performance on numerous tasks.
DeBERTa introduced an enhanced decoder adapted for BERT's encoder model for pretraining, proving to be highly effective.
We argue that the design and research around enhanced masked language modeling decoders have been underappreciated.
arXiv Detail & Related papers (2024-01-29T03:25:11Z) - MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining [10.421048804389343]
We introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining.
When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20.
This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models.
arXiv Detail & Related papers (2023-12-29T06:05:19Z) - DecBERT: Enhancing the Language Understanding of BERT with Causal
Attention Masks [33.558503823505056]
In this work, we focus on improving the position encoding ability of BERT with the causal attention masks.
We propose a new pre-trained language model DecBERT and evaluate it on the GLUE benchmark.
Experimental results show that (1) the causal attention mask is effective for BERT on the language understanding tasks; (2) our DecBERT model without position embeddings achieve comparable performance on the GLUE benchmark; and (3) our modification accelerates the pre-training process and DecBERT achieves better overall performance than the baseline systems.
arXiv Detail & Related papers (2022-04-19T06:12:48Z) - Universal Conditional Masked Language Pre-training for Neural Machine
Translation [29.334361879066602]
We propose CeMAT, a conditional masked language model pre-trained on large-scale bilingual and monolingual corpora.
We conduct extensive experiments and show that our CeMAT can achieve significant performance improvement for all scenarios.
arXiv Detail & Related papers (2022-03-17T10:00:33Z) - Improving language models by retrieving from trillions of tokens [50.42630445476544]
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus.
With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile.
arXiv Detail & Related papers (2021-12-08T17:32:34Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z) - lamBERT: Language and Action Learning Using Multimodal BERT [0.1942428068361014]
This study proposes the language and action learning using multimodal BERT (lamBERT) model.
Experiment is conducted in a grid environment that requires language understanding for the agent to act properly.
The lamBERT model obtained higher rewards in multitask settings and transfer settings when compared to other models.
arXiv Detail & Related papers (2020-04-15T13:54:55Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.