Sparse Double Descent in Vision Transformers: real or phantom threat?
- URL: http://arxiv.org/abs/2307.14253v1
- Date: Wed, 26 Jul 2023 15:33:35 GMT
- Title: Sparse Double Descent in Vision Transformers: real or phantom threat?
- Authors: Victor Qu\'etu, Marta Milovanovic and Enzo Tartaglione
- Abstract summary: Vision transformers (ViTs) are state-of-the-art thanks to their attention-based approach.
Some studies have reported a sparse double descent'' phenomenon that can occur in modern deep-learning models.
This raises practical questions about the optimal size of the model and the quest over finding the best trade-off between sparsity and performance.
- Score: 3.9533044769534444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViT) have been of broad interest in recent theoretical
and empirical works. They are state-of-the-art thanks to their attention-based
approach, which boosts the identification of key features and patterns within
images thanks to the capability of avoiding inductive bias, resulting in highly
accurate image analysis. Meanwhile, neoteric studies have reported a ``sparse
double descent'' phenomenon that can occur in modern deep-learning models,
where extremely over-parametrized models can generalize well. This raises
practical questions about the optimal size of the model and the quest over
finding the best trade-off between sparsity and performance is launched: are
Vision Transformers also prone to sparse double descent? Can we find a way to
avoid such a phenomenon? Our work tackles the occurrence of sparse double
descent on ViTs. Despite some works that have shown that traditional
architectures, like Resnet, are condemned to the sparse double descent
phenomenon, for ViTs we observe that an optimally-tuned $\ell_2$ regularization
relieves such a phenomenon. However, everything comes at a cost: optimal lambda
will sacrifice the potential compression of the ViT.
Related papers
- Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - Improving Interpretation Faithfulness for Vision Transformers [42.86486715574245]
Vision Transformers (ViTs) have achieved state-of-the-art performance for various vision tasks.
ViTs suffer from issues with explanation faithfulness, as their focal points are fragile to adversarial attacks.
We propose a rigorous approach to mitigate these issues by introducing Faithful ViTs (FViTs)
arXiv Detail & Related papers (2023-11-29T18:51:21Z) - Multi-Dimensional Hyena for Spatial Inductive Bias [69.3021852589771]
We present a data-efficient vision transformer that does not rely on self-attention.
Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer.
We show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
arXiv Detail & Related papers (2023-09-24T10:22:35Z) - 2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias.
We leverage an expressive variation of the multidimensional State Space Model.
Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z) - Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z) - The Principle of Diversity: Training Stronger Vision Transformers Calls
for Reducing All Levels of Redundancy [111.49944789602884]
This paper systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space.
We propose corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information.
arXiv Detail & Related papers (2022-03-12T04:48:12Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.