Related papers: DropKey

DropKey

URL: http://arxiv.org/abs/2208.02646v4
Date: Tue, 11 Apr 2023 07:35:33 GMT
Title: DropKey
Authors: Bonan Li and Yinhan Hu and Xuecheng Nie and Congying Han and Xiangjian Jiang and Tiande Guo and Luoqi Liu
Abstract summary: We focus on analyzing and improving the dropout technique for self-attention layers of Vision Transformer. We propose to move dropout operations forward ahead of attention matrix calculation and set the Key as the dropout unit. We experimentally validate the proposed schedule can avoid overfittings in low-level features and missing in high-level semantics.
Score: 9.846606347586906
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we focus on analyzing and improving the dropout technique for self-attention layers of Vision Transformer, which is important while surprisingly ignored by prior works. In particular, we conduct researches on three core questions: First, what to drop in self-attention layers? Different from dropping attention weights in literature, we propose to move dropout operations forward ahead of attention matrix calculation and set the Key as the dropout unit, yielding a novel dropout-before-softmax scheme. We theoretically verify that this scheme helps keep both regularization and probability features of attention weights, alleviating the overfittings problem to specific patterns and enhancing the model to globally capture vital information; Second, how to schedule the drop ratio in consecutive layers? In contrast to exploit a constant drop ratio for all layers, we present a new decreasing schedule that gradually decreases the drop ratio along the stack of self-attention layers. We experimentally validate the proposed schedule can avoid overfittings in low-level features and missing in high-level semantics, thus improving the robustness and stableness of model training; Third, whether need to perform structured dropout operation as CNN? We attempt patch-based block-version of dropout operation and find that this useful trick for CNN is not essential for ViT. Given exploration on the above three questions, we present the novel DropKey method that regards Key as the drop unit and exploits decreasing schedule for drop ratio, improving ViTs in a general way. Comprehensive experiments demonstrate the effectiveness of DropKey for various ViT architectures, e.g. T2T and VOLO, as well as for various vision tasks, e.g., image classification, object detection, human-object interaction detection and human body shape recovery.

Related papers

SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract. We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z)
DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions [63.61970125369834]
We present DropPos, a novel pretext task designed to reconstruct Dropped Positions. The code is publicly available at https://github.com/Haochen-Wang409/DropPos.
arXiv Detail & Related papers (2023-09-07T09:12:02Z)
Deep Augmentation: Dropout as Augmentation for Self-Supervised Learning [19.495587566796278]
Deep Augmentation is a method that applies dropout or PCA transformations to targeted layers in neural networks. We show that uniformly applying dropout across layers does not consistently improve performance. We also show that a stop-gradient operation is critical for ensuring dropout functions effectively as an augmentation.
arXiv Detail & Related papers (2023-03-25T19:03:57Z)
Stabilizing Transformer Training by Preventing Attention Entropy Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers. We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z)
Dropout Reduces Underfitting [85.61466286688385]
In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training. We find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient. Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards.
arXiv Detail & Related papers (2023-03-02T18:59:15Z)
Implicit and Efficient Point Cloud Completion for 3D Single Object Tracking [9.372859423951349]
We introduce two novel modules, i.e., Adaptive Refine Prediction (ARP) and Target Knowledge Transfer (TKT) Our model achieves state-of-the-art performance while maintaining a lower computational consumption.
arXiv Detail & Related papers (2022-09-01T15:11:06Z)
Adaptive Online Incremental Learning for Evolving Data Streams [4.3386084277869505]
The first major difficulty is concept drift, that is, the probability distribution in the streaming data would change as the data arrives. The second major difficulty is catastrophic forgetting, that is, forgetting what we have learned before when learning new knowledge. Our research builds on this observation and attempts to overcome these difficulties.
arXiv Detail & Related papers (2022-01-05T14:25:53Z)
Advanced Dropout: A Model-free Methodology for Bayesian Dropout Optimization [62.8384110757689]
Overfitting ubiquitously exists in real-world applications of deep neural networks (DNNs) The advanced dropout technique applies a model-free and easily implemented distribution with parametric prior, and adaptively adjusts dropout rate. We evaluate the effectiveness of the advanced dropout against nine dropout techniques on seven computer vision datasets.
arXiv Detail & Related papers (2020-10-11T13:19:58Z)
Scheduled DropHead: A Regularization Method for Transformer Models [111.18614166615968]
DropHead is a structured dropout method specifically designed for regularizing the multi-head attention mechanism. It drops entire attention-heads during training. It prevents the multi-head attention model from being dominated by a small portion of attention heads.
arXiv Detail & Related papers (2020-04-28T07:33:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.