Related papers: Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice

Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice

URL: http://arxiv.org/abs/2203.05962v1
Date: Wed, 9 Mar 2022 23:55:24 GMT
Title: Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice
Authors: Peihao Wang, Wenqing Zheng, Tianlong Chen, Zhangyang Wang
Abstract summary: Vision Transformer (ViT) has recently demonstrated promise in computer vision problems. ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity. We propose two techniques to mitigate the undesirable low-pass limitation.
Score: 111.47461527901318
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformer (ViT) has recently demonstrated promise in computer vision problems. However, unlike Convolutional Neural Networks (CNN), it is known that the performance of ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity. Despite a couple of empirical solutions, a rigorous framework studying on this scalability issue remains elusive. In this paper, we first establish a rigorous theory framework to analyze ViT features from the Fourier spectrum domain. We show that the self-attention mechanism inherently amounts to a low-pass filter, which indicates when ViT scales up its depth, excessive low-pass filtering will cause feature maps to only preserve their Direct-Current (DC) component. We then propose two straightforward yet effective techniques to mitigate the undesirable low-pass limitation. The first technique, termed AttnScale, decomposes a self-attention block into low-pass and high-pass components, then rescales and combines these two filters to produce an all-pass self-attention matrix. The second technique, termed FeatScale, re-weights feature maps on separate frequency bands to amplify the high-frequency signals. Both techniques are efficient and hyperparameter-free, while effectively overcoming relevant ViT training artifacts such as attention collapse and patch uniformity. By seamlessly plugging in our techniques to multiple ViT variants, we demonstrate that they consistently help ViTs benefit from deeper architectures, bringing up to 1.1% performance gains "for free" (e.g., with little parameter overhead). We publicly release our codes and pre-trained models at https://github.com/VITA-Group/ViT-Anti-Oversmoothing.

Related papers

Frequency-Dynamic Attention Modulation for Dense Prediction [14.066404173580864]
We propose a circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM)<n>FDAM directly modulates the overall frequency response of Vision Transformers (ViTs)
arXiv Detail & Related papers (2025-07-16T07:59:54Z)
Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT) In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z)
Training-Free Acceleration of ViTs with Delayed Spatial Merging [4.523939613157408]
Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. We improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. We build a unified inference framework called DSM: Delayed Spatial Merging.
arXiv Detail & Related papers (2023-03-04T05:34:25Z)
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning [138.29273453811945]
Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks. We propose a new Wavelet Vision Transformer (textbfWave-ViT) that formulates the invertible down-sampling with wavelet transforms and self-attention learning.
arXiv Detail & Related papers (2022-07-11T16:03:51Z)
ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator [21.351034332423374]
We propose a novel ViT based fine-grained object discriminator for Fine-Grained Visual Classification (FGVC) tasks. Besides a ViT backbone, it introduces three novel components, i.e. Attention Patch Combination (APC), Critical Regions Filter (CRF) and Complementary Tokens Integration (CTI) We conduct comprehensive experiments on widely used datasets and the results demonstrate that ViT-FOD is able to achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-03-24T02:34:57Z)
Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation. We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations. Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z)
Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage. We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)
On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention. We study the adversarial feature space of ViT models and their transferability. We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z)
DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.