Token Dropping for Efficient BERT Pretraining
- URL: http://arxiv.org/abs/2203.13240v1
- Date: Thu, 24 Mar 2022 17:50:46 GMT
- Title: Token Dropping for Efficient BERT Pretraining
- Authors: Le Hou, Richard Yuanzhe Pang, Tianyi Zhou, Yuexin Wu, Xinying Song,
Xiaodan Song, Denny Zhou
- Abstract summary: We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models.
We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead.
This simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.
- Score: 33.63507016806947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models generally allocate the same amount of computation
for each token in a given sequence. We develop a simple but effective "token
dropping" method to accelerate the pretraining of transformer models, such as
BERT, without degrading its performance on downstream tasks. In short, we drop
unimportant tokens starting from an intermediate layer in the model to make the
model focus on important tokens; the dropped tokens are later picked up by the
last layer of the model so that the model still produces full-length sequences.
We leverage the already built-in masked language modeling (MLM) loss to
identify unimportant tokens with practically no computational overhead. In our
experiments, this simple approach reduces the pretraining cost of BERT by 25%
while achieving similar overall fine-tuning performance on standard downstream
tasks.
Related papers
- Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification [6.660834045805309]
Pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism.
We propose integrating two strategies: token pruning and token combining.
Experiments with various datasets demonstrate superior performance compared to baseline models.
arXiv Detail & Related papers (2024-06-03T12:51:52Z) - TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - Unlocking the Transferability of Tokens in Deep Models for Tabular Data [67.11727608815636]
Fine-tuning a pre-trained deep neural network has become a successful paradigm in various machine learning tasks.
In this paper, we propose TabToken, a method aims at enhancing the quality of feature tokens.
We introduce a contrastive objective that regularizes the tokens, capturing the semantics within and across features.
arXiv Detail & Related papers (2023-10-23T17:53:09Z) - Revisiting Token Dropping Strategy in Efficient BERT Pretraining [102.24112230802011]
Token dropping is a strategy to speed up the pretraining of masked language models, such as BERT, by skipping the computation of a subset of the input tokens at several middle layers.
However, we empirically find that token dropping is prone to a semantic loss problem and falls short in handling semantic-intense tasks.
Motivated by this, we propose a simple yet effective semantic-consistent learning method (ScTD) to improve the token dropping.
arXiv Detail & Related papers (2023-05-24T15:59:44Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Position Masking for Language Models [0.0]
Masked language modeling (MLM) pre-training models such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose to expand upon this idea by masking the positions of some tokens along with the masked input token ids.
arXiv Detail & Related papers (2020-06-02T23:40:41Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.