Layer-wise Regularized Dropout for Neural Language Models
- URL: http://arxiv.org/abs/2402.16361v1
- Date: Mon, 26 Feb 2024 07:31:35 GMT
- Title: Layer-wise Regularized Dropout for Neural Language Models
- Authors: Shiwen Ni, Min Yang, Ruifeng Xu, Chengming Li and Xiping Hu
- Abstract summary: Layer-wise Regularized Dropout (LR-Drop) is specially designed for Transformer-based Language models.
We show that LR-Drop achieves superior performances, including state-of-the-art results.
- Score: 57.422407462430186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Among the various pre-trained neural language models that are popular today,
dropout is already an indispensable regularization technique. To solve the
inconsistency between training and inference caused by the randomness of
dropout, some studies use consistency training to regularize dropout at the
output layer. In this paper, we propose a novel Layer-wise Regularized Dropout
(LR-Drop), which is specially designed for Transformer-based Language models.
Specifically, LR-Drop layer-wise regularizes each Transformer layer using the
consistency training strategy. Each training sample passes through the two
siamese sub-models sampled by dropout, and then LR-Drop forces the hidden
states, multi-head attention matrices, and output distribution of the two
siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a
"self-distillation" framework, in which each sub-model generated by dropout is
the other's "teacher" model and "student" model. Through extensive experiments
on 8 natural language understanding datasets, 6 neural machine translation
datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we
show that LR-Drop achieves superior performances, including state-of-the-art
results.
Related papers
- R-Block: Regularized Block of Dropout for convolutional networks [0.0]
Dropout as a regularization technique is widely used in fully connected layers while is less effective in convolutional layers.
In this paper, we apply a mutual learning training strategy for convolutional layer regularization, namely R-Block.
We show that R-Block achieves better performance than other existing structured dropout variants.
arXiv Detail & Related papers (2023-07-27T18:53:14Z) - Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net
Estimation and Optimization [58.90989478049686]
Bi-Drop is a fine-tuning strategy that selectively updates model parameters using gradients from various sub-nets.
Experiments on the GLUE benchmark demonstrate that Bi-Drop consistently outperforms previous fine-tuning methods.
arXiv Detail & Related papers (2023-05-24T06:09:26Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - Dropout can Simulate Exponential Number of Models for Sample Selection
Techniques [0.0]
We show how we can modify two model-based sample selection methodologies to use an exponential number of shared models.
Not only is it more convenient to use a single model with Dropout, but this approach also combines the natural benefits of Dropout with that of training an exponential number of models.
arXiv Detail & Related papers (2022-02-26T17:53:26Z) - R-Drop: Regularized Dropout for Neural Networks [99.42791938544012]
Dropout is a powerful and widely used technique to regularize the training of deep neural networks.
We introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models to be consistent with each other.
arXiv Detail & Related papers (2021-06-28T08:01:26Z) - Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution.
This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes.
Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z) - UniDrop: A Simple yet Effective Technique to Improve Transformer without
Extra Cost [110.67392881417777]
Transformer architecture achieves great success in abundant natural language processing tasks.
We find simple techniques such as dropout, can greatly boost model performance with a careful design.
Specifically, we propose an approach named UniDrop to unites three different dropout techniques.
arXiv Detail & Related papers (2021-04-11T07:43:19Z) - Tight Integrated End-to-End Training for Cascaded Speech Translation [40.76367623739673]
A cascaded speech translation model relies on discrete and non-differentiable transcription.
Direct speech translation is an alternative method to avoid error propagation.
This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model.
arXiv Detail & Related papers (2020-11-24T15:43:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.