UniDrop: A Simple yet Effective Technique to Improve Transformer without
Extra Cost
- URL: http://arxiv.org/abs/2104.04946v1
- Date: Sun, 11 Apr 2021 07:43:19 GMT
- Title: UniDrop: A Simple yet Effective Technique to Improve Transformer without
Extra Cost
- Authors: Zhen Wu, Lijun Wu, Qi Meng, Yingce Xia, Shufang Xie, Tao Qin, Xinyu
Dai and Tie-Yan Liu
- Abstract summary: Transformer architecture achieves great success in abundant natural language processing tasks.
We find simple techniques such as dropout, can greatly boost model performance with a careful design.
Specifically, we propose an approach named UniDrop to unites three different dropout techniques.
- Score: 110.67392881417777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer architecture achieves great success in abundant natural language
processing tasks. The over-parameterization of the Transformer model has
motivated plenty of works to alleviate its overfitting for superior
performances. With some explorations, we find simple techniques such as
dropout, can greatly boost model performance with a careful design. Therefore,
in this paper, we integrate different dropout techniques into the training of
Transformer models. Specifically, we propose an approach named UniDrop to
unites three different dropout techniques from fine-grain to coarse-grain,
i.e., feature dropout, structure dropout, and data dropout. Theoretically, we
demonstrate that these three dropouts play different roles from regularization
perspectives. Empirically, we conduct experiments on both neural machine
translation and text classification benchmark datasets. Extensive results
indicate that Transformer with UniDrop can achieve around 1.5 BLEU improvement
on IWSLT14 translation tasks, and better accuracy for the classification even
using strong pre-trained RoBERTa as backbone.
Related papers
- Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis [16.253898272659242]
This study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to feedforward networks (FFNs)
Experiments on the large RefinedWeb dataset show that low-rank parametrization is both efficient (e.g., 2.6$times$ FFN speed-up with 32% parameters) and effective during training.
Motivated by this finding, we develop the wide and structured networks surpassing the current medium-sized and large-sized Transformer in perplexity and throughput performance.
arXiv Detail & Related papers (2024-07-13T10:08:55Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Layer-wise Regularized Dropout for Neural Language Models [57.422407462430186]
Layer-wise Regularized Dropout (LR-Drop) is specially designed for Transformer-based Language models.
We show that LR-Drop achieves superior performances, including state-of-the-art results.
arXiv Detail & Related papers (2024-02-26T07:31:35Z) - R-Drop: Regularized Dropout for Neural Networks [99.42791938544012]
Dropout is a powerful and widely used technique to regularize the training of deep neural networks.
We introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models to be consistent with each other.
arXiv Detail & Related papers (2021-06-28T08:01:26Z) - DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks.
We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z) - Advanced Dropout: A Model-free Methodology for Bayesian Dropout
Optimization [62.8384110757689]
Overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs)
The advanced dropout technique applies a model-free and easily implemented distribution with parametric prior, and adaptively adjusts dropout rate.
We evaluate the effectiveness of the advanced dropout against nine dropout techniques on seven computer vision datasets.
arXiv Detail & Related papers (2020-10-11T13:19:58Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.