MC-BERT: Efficient Language Pre-Training via a Meta Controller
- URL: http://arxiv.org/abs/2006.05744v2
- Date: Tue, 16 Jun 2020 09:15:08 GMT
- Title: MC-BERT: Efficient Language Pre-Training via a Meta Controller
- Authors: Zhenhui Xu, Linyuan Gong, Guolin Ke, Di He, Shuxin Zheng, Liwei Wang,
Jiang Bian, Tie-Yan Liu
- Abstract summary: Large-scale pre-training is computationally expensive.
ELECTRA, an early attempt to accelerate pre-training, trains a discriminative model that predicts whether each input token was replaced by a generator.
We propose a novel meta-learning framework, MC-BERT, to achieve better efficiency and effectiveness.
- Score: 96.68140474547602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained contextual representations (e.g., BERT) have become the
foundation to achieve state-of-the-art results on many NLP tasks. However,
large-scale pre-training is computationally expensive. ELECTRA, an early
attempt to accelerate pre-training, trains a discriminative model that predicts
whether each input token was replaced by a generator. Our studies reveal that
ELECTRA's success is mainly due to its reduced complexity of the pre-training
task: the binary classification (replaced token detection) is more efficient to
learn than the generation task (masked language modeling). However, such a
simplified task is less semantically informative. To achieve better efficiency
and effectiveness, we propose a novel meta-learning framework, MC-BERT. The
pre-training task is a multi-choice cloze test with a reject option, where a
meta controller network provides training input and candidates. Results over
GLUE natural language understanding benchmark demonstrate that our proposed
method is both efficient and effective: it outperforms baselines on GLUE
semantic tasks given the same computational budget.
Related papers
- Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained
Models [43.7024573212373]
We adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks.
Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead.
arXiv Detail & Related papers (2022-05-30T16:32:30Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z) - Training ELECTRA Augmented with Multi-word Selection [53.77046731238381]
We present a new text encoder pre-training method that improves ELECTRA based on multi-task learning.
Specifically, we train the discriminator to simultaneously detect replaced tokens and select original tokens from candidate sets.
arXiv Detail & Related papers (2021-05-31T23:19:00Z) - Self-Supervised Meta-Learning for Few-Shot Natural Language
Classification Tasks [40.97125791174191]
We propose a self-supervised approach to generate a large, rich, meta-learning task distribution from unlabeled text.
We show that this meta-training leads to better few-shot generalization than language-model pre-training followed by finetuning.
arXiv Detail & Related papers (2020-09-17T17:53:59Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.