Revisiting Token Dropping Strategy in Efficient BERT Pretraining
- URL: http://arxiv.org/abs/2305.15273v1
- Date: Wed, 24 May 2023 15:59:44 GMT
- Title: Revisiting Token Dropping Strategy in Efficient BERT Pretraining
- Authors: Qihuang Zhong, Liang Ding, Juhua Liu, Xuebo Liu, Min Zhang, Bo Du and
Dacheng Tao
- Abstract summary: Token dropping is a strategy to speed up the pretraining of masked language models, such as BERT, by skipping the computation of a subset of the input tokens at several middle layers.
However, we empirically find that token dropping is prone to a semantic loss problem and falls short in handling semantic-intense tasks.
Motivated by this, we propose a simple yet effective semantic-consistent learning method (ScTD) to improve the token dropping.
- Score: 102.24112230802011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Token dropping is a recently-proposed strategy to speed up the pretraining of
masked language models, such as BERT, by skipping the computation of a subset
of the input tokens at several middle layers. It can effectively reduce the
training time without degrading much performance on downstream tasks. However,
we empirically find that token dropping is prone to a semantic loss problem and
falls short in handling semantic-intense tasks. Motivated by this, we propose a
simple yet effective semantic-consistent learning method (ScTD) to improve the
token dropping. ScTD aims to encourage the model to learn how to preserve the
semantic information in the representation space. Extensive experiments on 12
tasks show that, with the help of our ScTD, token dropping can achieve
consistent and significant performance gains across all task types and model
sizes. More encouragingly, ScTD saves up to 57% of pretraining time and brings
up to +1.56% average improvement over the vanilla token dropping.
Related papers
- Unlocking the Transferability of Tokens in Deep Models for Tabular Data [67.11727608815636]
Fine-tuning a pre-trained deep neural network has become a successful paradigm in various machine learning tasks.
In this paper, we propose TabToken, a method aims at enhancing the quality of feature tokens.
We introduce a contrastive objective that regularizes the tokens, capturing the semantics within and across features.
arXiv Detail & Related papers (2023-10-23T17:53:09Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - Token Dropping for Efficient BERT Pretraining [33.63507016806947]
We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models.
We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead.
This simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.
arXiv Detail & Related papers (2022-03-24T17:50:46Z) - AdapLeR: Speeding up Inference by Adaptive Length Reduction [15.57872065467772]
We propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance.
Our method dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost.
Our experiments on several diverse classification tasks show speedups up to 22x during inference time without much sacrifice in performance.
arXiv Detail & Related papers (2022-03-16T23:41:38Z) - A Simple Baseline for Semi-supervised Semantic Segmentation with Strong
Data Augmentation [74.8791451327354]
We propose a simple yet effective semi-supervised learning framework for semantic segmentation.
A set of simple design and training techniques can collectively improve the performance of semi-supervised semantic segmentation significantly.
Our method achieves state-of-the-art results in the semi-supervised settings on the Cityscapes and Pascal VOC datasets.
arXiv Detail & Related papers (2021-04-15T06:01:39Z) - MC-BERT: Efficient Language Pre-Training via a Meta Controller [96.68140474547602]
Large-scale pre-training is computationally expensive.
ELECTRA, an early attempt to accelerate pre-training, trains a discriminative model that predicts whether each input token was replaced by a generator.
We propose a novel meta-learning framework, MC-BERT, to achieve better efficiency and effectiveness.
arXiv Detail & Related papers (2020-06-10T09:22:19Z) - Improving Semantic Segmentation via Self-Training [75.07114899941095]
We show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm.
We first train a teacher model on labeled data, and then generate pseudo labels on a large set of unlabeled data.
Our robust training framework can digest human-annotated and pseudo labels jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets.
arXiv Detail & Related papers (2020-04-30T17:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.