ELECTRA is a Zero-Shot Learner, Too
- URL: http://arxiv.org/abs/2207.08141v2
- Date: Wed, 20 Jul 2022 07:55:59 GMT
- Title: ELECTRA is a Zero-Shot Learner, Too
- Authors: Shiwen Ni and Hung-Yu Kao
- Abstract summary: "Pre-train, prompt, and predict" has achieved remarkable achievements compared with the "pre-train, fine-tune" paradigm.
In this paper, we propose a novel replaced token detection (RTD)-based prompt learning method.
Experimental results show that ELECTRA model based onRTD-prompt learning achieves surprisingly state-of-the-art zero-shot performance.
- Score: 14.315501760755609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, for few-shot or even zero-shot learning, the new paradigm
"pre-train, prompt, and predict" has achieved remarkable achievements compared
with the "pre-train, fine-tune" paradigm. After the success of prompt-based
GPT-3, a series of masked language model (MLM)-based (e.g., BERT, RoBERTa)
prompt learning methods became popular and widely used. However, another
efficient pre-trained discriminative model, ELECTRA, has probably been
neglected. In this paper, we attempt to accomplish several NLP tasks in the
zero-shot scenario using a novel our proposed replaced token detection
(RTD)-based prompt learning method. Experimental results show that ELECTRA
model based on RTD-prompt learning achieves surprisingly state-of-the-art
zero-shot performance. Numerically, compared to MLM-RoBERTa-large and
MLM-BERT-large, our RTD-ELECTRA-large has an average of about 8.4% and 13.7%
improvement on all 15 tasks. Especially on the SST-2 task, our
RTD-ELECTRA-large achieves an astonishing 90.1% accuracy without any training
data. Overall, compared to the pre-trained masked language models, the
pre-trained replaced token detection model performs better in zero-shot
learning. The source code is available at:
https://github.com/nishiwen1214/RTD-ELECTRA.
Related papers
- Tuning Language Models as Training Data Generators for
Augmentation-Enhanced Few-Shot Learning [30.65315081964461]
We study few-shot learning with pretrained language models (PLMs) from a different perspective.
We first tune an autoregressive PLM on the few-shot samples and then use it as a generator to synthesize a large amount of novel training samples.
Our approach FewGen achieves an overall better result across seven classification tasks of the GLUE benchmark than existing few-shot learning methods.
arXiv Detail & Related papers (2022-11-06T06:46:47Z) - Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained
Models [43.7024573212373]
We adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks.
Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead.
arXiv Detail & Related papers (2022-05-30T16:32:30Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z) - Prompt Consistency for Zero-Shot Task Generalization [118.81196556175797]
In this paper, we explore methods to utilize unlabeled data to improve zero-shot performance.
Specifically, we take advantage of the fact that multiple prompts can be used to specify a single task, and propose to regularize prompt consistency.
Our approach outperforms the state-of-the-art zero-shot learner, T0, on 9 out of 11 datasets across 4 NLP tasks by up to 10.6 absolute points in terms of accuracy.
arXiv Detail & Related papers (2022-04-29T19:18:37Z) - Improving Neural Machine Translation by Denoising Training [95.96569884410137]
We present a simple and effective pretraining strategy Denoising Training DoT for neural machine translation.
We update the model parameters with source- and target-side denoising tasks at the early stage and then tune the model normally.
Experiments show DoT consistently improves the neural machine translation performance across 12 bilingual and 16 multilingual directions.
arXiv Detail & Related papers (2022-01-19T00:11:38Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on
Resource Rich Tasks [25.05882459314221]
We show that as the number of training examples grow into the millions, the accuracy gap between finetuning BERT-based model and training vanilla LSTM from scratch narrows to within 1%.
Our findings indicate that pre-trained models might reach a diminishing return point as the supervised data size increases significantly.
arXiv Detail & Related papers (2020-06-15T18:18:59Z) - MC-BERT: Efficient Language Pre-Training via a Meta Controller [96.68140474547602]
Large-scale pre-training is computationally expensive.
ELECTRA, an early attempt to accelerate pre-training, trains a discriminative model that predicts whether each input token was replaced by a generator.
We propose a novel meta-learning framework, MC-BERT, to achieve better efficiency and effectiveness.
arXiv Detail & Related papers (2020-06-10T09:22:19Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.