Related papers: R-Drop: Regularized Dropout for Neural Networks

R-Drop: Regularized Dropout for Neural Networks

URL: http://arxiv.org/abs/2106.14448v1
Date: Mon, 28 Jun 2021 08:01:26 GMT
Title: R-Drop: Regularized Dropout for Neural Networks
Authors: Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, Tie-Yan Liu
Abstract summary: Dropout is a powerful and widely used technique to regularize the training of deep neural networks. We introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models to be consistent with each other.
Score: 99.42791938544012
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dropout is a powerful and widely used technique to regularize the training of deep neural networks. In this paper, we introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the freedom of the model parameters and complements dropout. Experiments on $\bf{5}$ widely used deep learning tasks ($\bf{18}$ datasets in total), including neural machine translation, abstractive summarization, language understanding, language modeling, and image classification, show that R-Drop is universally effective. In particular, it yields substantial improvements when applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model on WMT14 English$\to$German translation ($\bf{30.91}$ BLEU) and WMT14 English$\to$French translation ($\bf{43.95}$ BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub{\url{https://github.com/dropreg/R-Drop}}.

Related papers

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets. We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation. Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z)
Layer-wise Regularized Dropout for Neural Language Models [57.422407462430186]
Layer-wise Regularized Dropout (LR-Drop) is specially designed for Transformer-based Language models. We show that LR-Drop achieves superior performances, including state-of-the-art results.
arXiv Detail & Related papers (2024-02-26T07:31:35Z)
BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling [0.0]
Common Crawl might contain enough noise to make this pre-training sub-optimal. We present a novel data-centric technique which enables the pre-training of language models in roughly half the amount of steps. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.
arXiv Detail & Related papers (2022-07-14T10:48:42Z)
Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution. This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes. Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z)
UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost [110.67392881417777]
Transformer architecture achieves great success in abundant natural language processing tasks. We find simple techniques such as dropout, can greatly boost model performance with a careful design. Specifically, we propose an approach named UniDrop to unites three different dropout techniques.
arXiv Detail & Related papers (2021-04-11T07:43:19Z)
Learning Light-Weight Translation Models from Deep Transformer [25.386460662408773]
We propose a novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model. Our compressed model is 8X shallower than the deep model, with almost no loss in BLEU. To further enhance the teacher model, we present a Skipping Sub-Layer method to randomly omit sub-layers to introduce perturbation into training.
arXiv Detail & Related papers (2020-12-27T05:33:21Z)
Advanced Dropout: A Model-free Methodology for Bayesian Dropout Optimization [62.8384110757689]
Overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs) The advanced dropout technique applies a model-free and easily implemented distribution with parametric prior, and adaptively adjusts dropout rate. We evaluate the effectiveness of the advanced dropout against nine dropout techniques on seven computer vision datasets.
arXiv Detail & Related papers (2020-10-11T13:19:58Z)
Machine Learning's Dropout Training is Distributionally Robust Optimal [10.937094979510212]
This paper shows that dropout training in Generalized Linear Models provides out-of-sample expected loss guarantees. It also provides a novel, parallelizable, Unbiased Multi-Level Monte Carlo algorithm to speed-up the implementation of dropout training.
arXiv Detail & Related papers (2020-09-13T23:13:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.