R-Drop: Regularized Dropout for Neural Networks
- URL: http://arxiv.org/abs/2106.14448v1
- Date: Mon, 28 Jun 2021 08:01:26 GMT
- Title: R-Drop: Regularized Dropout for Neural Networks
- Authors: Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei
Chen, Min Zhang, Tie-Yan Liu
- Abstract summary: Dropout is a powerful and widely used technique to regularize the training of deep neural networks.
We introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models to be consistent with each other.
- Score: 99.42791938544012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dropout is a powerful and widely used technique to regularize the training of
deep neural networks. In this paper, we introduce a simple regularization
strategy upon dropout in model training, namely R-Drop, which forces the output
distributions of different sub models generated by dropout to be consistent
with each other. Specifically, for each training sample, R-Drop minimizes the
bidirectional KL-divergence between the output distributions of two sub models
sampled by dropout. Theoretical analysis reveals that R-Drop reduces the
freedom of the model parameters and complements dropout. Experiments on
$\bf{5}$ widely used deep learning tasks ($\bf{18}$ datasets in total),
including neural machine translation, abstractive summarization, language
understanding, language modeling, and image classification, show that R-Drop is
universally effective. In particular, it yields substantial improvements when
applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large,
and BART, and achieves state-of-the-art (SOTA) performances with the vanilla
Transformer model on WMT14 English$\to$German translation ($\bf{30.91}$ BLEU)
and WMT14 English$\to$French translation ($\bf{43.95}$ BLEU), even surpassing
models trained with extra large-scale data and expert-designed advanced
variants of Transformer models. Our code is available at
GitHub{\url{https://github.com/dropreg/R-Drop}}.
Related papers
- FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Layer-wise Regularized Dropout for Neural Language Models [57.422407462430186]
Layer-wise Regularized Dropout (LR-Drop) is specially designed for Transformer-based Language models.
We show that LR-Drop achieves superior performances, including state-of-the-art results.
arXiv Detail & Related papers (2024-02-26T07:31:35Z) - BERTIN: Efficient Pre-Training of a Spanish Language Model using
Perplexity Sampling [0.0]
Common Crawl might contain enough noise to make this pre-training sub-optimal.
We present a novel data-centric technique which enables the pre-training of language models in roughly half the amount of steps.
Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.
arXiv Detail & Related papers (2022-07-14T10:48:42Z) - Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution.
This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes.
Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z) - UniDrop: A Simple yet Effective Technique to Improve Transformer without
Extra Cost [110.67392881417777]
Transformer architecture achieves great success in abundant natural language processing tasks.
We find simple techniques such as dropout, can greatly boost model performance with a careful design.
Specifically, we propose an approach named UniDrop to unites three different dropout techniques.
arXiv Detail & Related papers (2021-04-11T07:43:19Z) - Learning Light-Weight Translation Models from Deep Transformer [25.386460662408773]
We propose a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model.
Our compressed model is 8X shallower than the deep model, with almost no loss in BLEU.
To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training.
arXiv Detail & Related papers (2020-12-27T05:33:21Z) - Advanced Dropout: A Model-free Methodology for Bayesian Dropout
Optimization [62.8384110757689]
Overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs)
The advanced dropout technique applies a model-free and easily implemented distribution with parametric prior, and adaptively adjusts dropout rate.
We evaluate the effectiveness of the advanced dropout against nine dropout techniques on seven computer vision datasets.
arXiv Detail & Related papers (2020-10-11T13:19:58Z) - Machine Learning's Dropout Training is Distributionally Robust Optimal [10.937094979510212]
This paper shows that dropout training in Generalized Linear Models provides out-of-sample expected loss guarantees.
It also provides a novel, parallelizable, Unbiased Multi-Level Monte Carlo algorithm to speed-up the implementation of dropout training.
arXiv Detail & Related papers (2020-09-13T23:13:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.