Scheduled DropHead: A Regularization Method for Transformer Models
- URL: http://arxiv.org/abs/2004.13342v2
- Date: Sun, 1 Nov 2020 15:57:37 GMT
- Title: Scheduled DropHead: A Regularization Method for Transformer Models
- Authors: Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, Ming Zhou
- Abstract summary: DropHead is a structured dropout method specifically designed for regularizing the multi-head attention mechanism.
It drops entire attention-heads during training.
It prevents the multi-head attention model from being dominated by a small portion of attention heads.
- Score: 111.18614166615968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce DropHead, a structured dropout method
specifically designed for regularizing the multi-head attention mechanism,
which is a key component of transformer, a state-of-the-art model for various
NLP tasks. In contrast to the conventional dropout mechanisms which randomly
drop units or connections, the proposed DropHead is a structured dropout
method. It drops entire attention-heads during training and It prevents the
multi-head attention model from being dominated by a small portion of attention
heads while also reduces the risk of overfitting the training data, thus making
use of the multi-head attention mechanism more efficiently. Motivated by recent
studies about the learning dynamic of the multi-head attention mechanism, we
propose a specific dropout rate schedule to adaptively adjust the dropout rate
of DropHead and achieve better regularization effect. Experimental results on
both machine translation and text classification benchmark datasets demonstrate
the effectiveness of the proposed approach.
Related papers
- HUWSOD: Holistic Self-training for Unified Weakly Supervised Object Detection [66.42229859018775]
We introduce a unified, high-capacity weakly supervised object detection (WSOD) network called HUWSOD.
HUWSOD incorporates a self-supervised proposal generator and an autoencoder proposal generator with a multi-rate re-supervised pyramid to replace traditional object proposals.
Our findings indicate that randomly boxes, although significantly different from well-designed offline object proposals, are effective for WSOD training.
arXiv Detail & Related papers (2024-06-27T17:59:49Z) - Mitigating Biases with Diverse Ensembles and Diffusion Models [99.6100669122048]
We propose an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs)
We show that DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features.
We show that DPM-guided diversification is sufficient to remove dependence on primary shortcut cues, without a need for additional supervised signals.
arXiv Detail & Related papers (2023-11-23T15:47:33Z) - Perceiver-based CDF Modeling for Time Series Forecasting [25.26713741799865]
We propose a new architecture, called perceiver-CDF, for modeling cumulative distribution functions (CDF) of time series data.
Our approach combines the perceiver architecture with a copula-based attention mechanism tailored for multimodal time series prediction.
Experiments on the unimodal and multimodal benchmarks consistently demonstrate a 20% improvement over state-of-the-art methods.
arXiv Detail & Related papers (2023-10-03T01:13:17Z) - Stabilizing and Improving Federated Learning with Non-IID Data and
Client Dropout [15.569507252445144]
Label distribution skew induced data heterogeniety has been shown to be a significant obstacle that limits the model performance in federated learning.
We propose a simple yet effective framework by introducing a prior-calibrated softmax function for computing the cross-entropy loss.
The improved model performance over existing baselines in the presence of non-IID data and client dropout is demonstrated.
arXiv Detail & Related papers (2023-03-11T05:17:59Z) - Debiased Fine-Tuning for Vision-language Models by Prompt Regularization [50.41984119504716]
We present a new paradigm for fine-tuning large-scale vision pre-trained models on downstream task, dubbed Prompt Regularization (ProReg)
ProReg uses the prediction by prompting the pretrained model to regularize the fine-tuning.
We show the consistently strong performance of ProReg compared with conventional fine-tuning, zero-shot prompt, prompt tuning, and other state-of-the-art methods.
arXiv Detail & Related papers (2023-01-29T11:53:55Z) - AD-DROP: Attribution-Driven Dropout for Robust Language Model
Fine-Tuning [24.028662731799127]
We find that dropping attention positions with low attribution scores can accelerate training and increase the risk of overfitting.
We develop a cross-tuning strategy to alternate fine-tuning and AD-DROP to avoid dropping high-attribution positions excessively.
arXiv Detail & Related papers (2022-10-12T02:54:41Z) - Semi-Supervised Temporal Action Detection with Proposal-Free Masking [134.26292288193298]
We propose a novel Semi-supervised Temporal action detection model based on PropOsal-free Temporal mask (SPOT)
SPOT outperforms state-of-the-art alternatives, often by a large margin.
arXiv Detail & Related papers (2022-07-14T16:58:47Z) - Multiformer: A Head-Configurable Transformer-Based Model for Direct
Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head.
By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions.
Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z) - PLACE dropout: A Progressive Layer-wise and Channel-wise Dropout for
Domain Generalization [29.824723021053565]
Domain generalization (DG) aims to learn a generic model from multiple observed source domains.
The major challenge in DG is that the model inevitably faces a severe overfitting issue due to the domain gap between source and target domains.
We develop a novel layer-wise and channel-wise dropout for DG, which randomly selects one layer and then randomly selects its channels to conduct dropout.
arXiv Detail & Related papers (2021-12-07T13:23:52Z) - Causally-motivated Shortcut Removal Using Auxiliary Labels [63.686580185674195]
Key challenge to learning such risk-invariant predictors is shortcut learning.
We propose a flexible, causally-motivated approach to address this challenge.
We show both theoretically and empirically that this causally-motivated regularization scheme yields robust predictors.
arXiv Detail & Related papers (2021-05-13T16:58:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.