Instance Regularization for Discriminative Language Model Pre-training
- URL: http://arxiv.org/abs/2210.05471v1
- Date: Tue, 11 Oct 2022 14:16:37 GMT
- Title: Instance Regularization for Discriminative Language Model Pre-training
- Authors: Zhuosheng Zhang, Hai Zhao, Ming Zhou
- Abstract summary: This work proposes to estimate the complexity of restoring the original sentences from corrupted ones in language model pre-training.
Experimental results on natural language understanding and reading comprehension benchmarks show that our approach improves pre-training efficiency, effectiveness, and robustness.
- Score: 108.41891836796366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Discriminative pre-trained language models (PrLMs) can be generalized as
denoising auto-encoders that work with two procedures, ennoising and denoising.
First, an ennoising process corrupts texts with arbitrary noising functions to
construct training instances. Then, a denoising language model is trained to
restore the corrupted tokens. Existing studies have made progress by optimizing
independent strategies of either ennoising or denosing. They treat training
instances equally throughout the training process, with little attention on the
individual contribution of those instances. To model explicit signals of
instance contribution, this work proposes to estimate the complexity of
restoring the original sentences from corrupted ones in language model
pre-training. The estimations involve the corruption degree in the ennoising
data construction process and the prediction confidence in the denoising
counterpart. Experimental results on natural language understanding and reading
comprehension benchmarks show that our approach improves pre-training
efficiency, effectiveness, and robustness. Code is publicly available at
https://github.com/cooelf/InstanceReg
Related papers
- GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator [114.8954615026781]
We propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator.
GanLM is trained with two pre-training objectives: replaced token detection and replaced token denoising.
Experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models.
arXiv Detail & Related papers (2022-12-20T12:51:11Z) - Robustification of Multilingual Language Models to Real-world Noise with
Robust Contrastive Pretraining [14.087882550564169]
We assess the robustness of neural models on noisy data and suggest improvements are limited to the English language.
To benchmark the performance of pretrained multilingual models, we construct noisy datasets covering five languages and four NLP tasks.
We propose Robust Contrastive Pretraining (RCP) to boost the zero-shot cross-lingual robustness of multilingual pretrained models.
arXiv Detail & Related papers (2022-10-10T15:40:43Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - How Robust is Neural Machine Translation to Language Imbalance in
Multilingual Tokenizer Training? [86.48323488619629]
We analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus.
We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected.
arXiv Detail & Related papers (2022-04-29T17:50:36Z) - COCO-LM: Correcting and Contrasting Text Sequences for Language Model
Pretraining [59.169836983883656]
COCO-LM is a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences.
COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences.
Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.
arXiv Detail & Related papers (2021-02-16T22:24:29Z) - CAPT: Contrastive Pre-Training for Learning Denoised Sequence
Representations [42.86803751871867]
We present ContrAstive Pre-Training (CAPT) to learn noise invariant sequence representations.
CAPT encourages the consistency between representations of the original sequence and its corrupted version via unsupervised instance-wise training signals.
arXiv Detail & Related papers (2020-10-13T13:08:34Z) - Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models
via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results.
We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks.
Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.