DropKey
- URL: http://arxiv.org/abs/2208.02646v4
- Date: Tue, 11 Apr 2023 07:35:33 GMT
- Title: DropKey
- Authors: Bonan Li and Yinhan Hu and Xuecheng Nie and Congying Han and Xiangjian
Jiang and Tiande Guo and Luoqi Liu
- Abstract summary: We focus on analyzing and improving the dropout technique for self-attention layers of Vision Transformer.
We propose to move dropout operations forward ahead of attention matrix calculation and set the Key as the dropout unit.
We experimentally validate the proposed schedule can avoid overfittings in low-level features and missing in high-level semantics.
- Score: 9.846606347586906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we focus on analyzing and improving the dropout technique for
self-attention layers of Vision Transformer, which is important while
surprisingly ignored by prior works. In particular, we conduct researches on
three core questions: First, what to drop in self-attention layers? Different
from dropping attention weights in literature, we propose to move dropout
operations forward ahead of attention matrix calculation and set the Key as the
dropout unit, yielding a novel dropout-before-softmax scheme. We theoretically
verify that this scheme helps keep both regularization and probability features
of attention weights, alleviating the overfittings problem to specific patterns
and enhancing the model to globally capture vital information; Second, how to
schedule the drop ratio in consecutive layers? In contrast to exploit a
constant drop ratio for all layers, we present a new decreasing schedule that
gradually decreases the drop ratio along the stack of self-attention layers. We
experimentally validate the proposed schedule can avoid overfittings in
low-level features and missing in high-level semantics, thus improving the
robustness and stableness of model training; Third, whether need to perform
structured dropout operation as CNN? We attempt patch-based block-version of
dropout operation and find that this useful trick for CNN is not essential for
ViT. Given exploration on the above three questions, we present the novel
DropKey method that regards Key as the drop unit and exploits decreasing
schedule for drop ratio, improving ViTs in a general way. Comprehensive
experiments demonstrate the effectiveness of DropKey for various ViT
architectures, e.g. T2T and VOLO, as well as for various vision tasks, e.g.,
image classification, object detection, human-object interaction detection and
human body shape recovery.
Related papers
- SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract.
We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z) - DropPos: Pre-Training Vision Transformers by Reconstructing Dropped
Positions [63.61970125369834]
We present DropPos, a novel pretext task designed to reconstruct Dropped Positions.
The code is publicly available at https://github.com/Haochen-Wang409/DropPos.
arXiv Detail & Related papers (2023-09-07T09:12:02Z) - Stabilizing Transformer Training by Preventing Attention Entropy
Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers.
We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training.
We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z) - Dropout Reduces Underfitting [85.61466286688385]
In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training.
We find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient.
Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards.
arXiv Detail & Related papers (2023-03-02T18:59:15Z) - Implicit and Efficient Point Cloud Completion for 3D Single Object
Tracking [9.372859423951349]
We introduce two novel modules, i.e., Adaptive Refine Prediction (ARP) and Target Knowledge Transfer (TKT)
Our model achieves state-of-the-art performance while maintaining a lower computational consumption.
arXiv Detail & Related papers (2022-09-01T15:11:06Z) - Adaptive Online Incremental Learning for Evolving Data Streams [4.3386084277869505]
The first major difficulty is concept drift, that is, the probability distribution in the streaming data would change as the data arrives.
The second major difficulty is catastrophic forgetting, that is, forgetting what we have learned before when learning new knowledge.
Our research builds on this observation and attempts to overcome these difficulties.
arXiv Detail & Related papers (2022-01-05T14:25:53Z) - Advanced Dropout: A Model-free Methodology for Bayesian Dropout
Optimization [62.8384110757689]
Overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs)
The advanced dropout technique applies a model-free and easily implemented distribution with parametric prior, and adaptively adjusts dropout rate.
We evaluate the effectiveness of the advanced dropout against nine dropout techniques on seven computer vision datasets.
arXiv Detail & Related papers (2020-10-11T13:19:58Z) - Scheduled DropHead: A Regularization Method for Transformer Models [111.18614166615968]
DropHead is a structured dropout method specifically designed for regularizing the multi-head attention mechanism.
It drops entire attention-heads during training.
It prevents the multi-head attention model from being dominated by a small portion of attention heads.
arXiv Detail & Related papers (2020-04-28T07:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.