Related papers: Scheduled DropHead: A Regularization Method for Transformer Models

Scheduled DropHead: A Regularization Method for Transformer Models

URL: http://arxiv.org/abs/2004.13342v2
Date: Sun, 1 Nov 2020 15:57:37 GMT
Title: Scheduled DropHead: A Regularization Method for Transformer Models
Authors: Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, Ming Zhou
Abstract summary: DropHead is a structured dropout method specifically designed for regularizing the multi-head attention mechanism. It drops entire attention-heads during training. It prevents the multi-head attention model from being dominated by a small portion of attention heads.
Score: 111.18614166615968
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce DropHead, a structured dropout method specifically designed for regularizing the multi-head attention mechanism, which is a key component of transformer, a state-of-the-art model for various NLP tasks. In contrast to the conventional dropout mechanisms which randomly drop units or connections, the proposed DropHead is a structured dropout method. It drops entire attention-heads during training and It prevents the multi-head attention model from being dominated by a small portion of attention heads while also reduces the risk of overfitting the training data, thus making use of the multi-head attention mechanism more efficiently. Motivated by recent studies about the learning dynamic of the multi-head attention mechanism, we propose a specific dropout rate schedule to adaptively adjust the dropout rate of DropHead and achieve better regularization effect. Experimental results on both machine translation and text classification benchmark datasets demonstrate the effectiveness of the proposed approach.

Related papers

MiMu: Mitigating Multiple Shortcut Learning Behavior of Transformers [19.27328009299697]
Empirical Risk Minimization (ERM) models often rely on spurious correlations between features and labels during the learning process. We propose MiMu, a novel method integrated with Transformer-based generalizations designed to Mitigate Multiple shortcut learning behavior.
arXiv Detail & Related papers (2025-04-14T08:11:09Z)
Adjusting Pretrained Backbones for Performativity [34.390793811659556]
We propose a novel technique to adjust pretrained backbones for performativity in a modular way. We show how it leads to smaller loss along the retraining trajectory and enables us to effectively select among candidate models to anticipate performance degradations.
arXiv Detail & Related papers (2024-10-06T14:41:13Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
HUWSOD: Holistic Self-training for Unified Weakly Supervised Object Detection [66.42229859018775]
We introduce a unified, high-capacity weakly supervised object detection (WSOD) network called HUWSOD. HUWSOD incorporates a self-supervised proposal generator and an autoencoder proposal generator with a multi-rate re-supervised pyramid to replace traditional object proposals. Our findings indicate that randomly boxes, although significantly different from well-designed offline object proposals, are effective for WSOD training.
arXiv Detail & Related papers (2024-06-27T17:59:49Z)
Perceiver-based CDF Modeling for Time Series Forecasting [25.26713741799865]
We propose a new architecture, called perceiver-CDF, for modeling cumulative distribution functions (CDF) of time series data. Our approach combines the perceiver architecture with a copula-based attention mechanism tailored for multimodal time series prediction. Experiments on the unimodal and multimodal benchmarks consistently demonstrate a 20% improvement over state-of-the-art methods.
arXiv Detail & Related papers (2023-10-03T01:13:17Z)
Debiased Fine-Tuning for Vision-language Models by Prompt Regularization [50.41984119504716]
We present a new paradigm for fine-tuning large-scale vision pre-trained models on downstream task, dubbed Prompt Regularization (ProReg) ProReg uses the prediction by prompting the pretrained model to regularize the fine-tuning. We show the consistently strong performance of ProReg compared with conventional fine-tuning, zero-shot prompt, prompt tuning, and other state-of-the-art methods.
arXiv Detail & Related papers (2023-01-29T11:53:55Z)
Semi-Supervised Temporal Action Detection with Proposal-Free Masking [134.26292288193298]
We propose a novel Semi-supervised Temporal action detection model based on PropOsal-free Temporal mask (SPOT) SPOT outperforms state-of-the-art alternatives, often by a large margin.
arXiv Detail & Related papers (2022-07-14T16:58:47Z)
Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head. By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions. Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z)
PLACE dropout: A Progressive Layer-wise and Channel-wise Dropout for Domain Generalization [29.824723021053565]
Domain generalization (DG) aims to learn a generic model from multiple observed source domains. The major challenge in DG is that the model inevitably faces a severe overfitting issue due to the domain gap between source and target domains. We develop a novel layer-wise and channel-wise dropout for DG, which randomly selects one layer and then randomly selects its channels to conduct dropout.
arXiv Detail & Related papers (2021-12-07T13:23:52Z)
Causally-motivated Shortcut Removal Using Auxiliary Labels [63.686580185674195]
Key challenge to learning such risk-invariant predictors is shortcut learning. We propose a flexible, causally-motivated approach to address this challenge. We show both theoretically and empirically that this causally-motivated regularization scheme yields robust predictors.
arXiv Detail & Related papers (2021-05-13T16:58:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.