Are Vision Transformers Robust to Spurious Correlations?
- URL: http://arxiv.org/abs/2203.09125v1
- Date: Thu, 17 Mar 2022 07:03:37 GMT
- Title: Are Vision Transformers Robust to Spurious Correlations?
- Authors: Soumya Suvra Ghosal, Yifei Ming and Yixuan Li
- Abstract summary: Deep neural networks may be susceptible to learning spurious correlations that hold on average but not in atypical test samples.
We investigate the robustness of vision transformers to spurious correlations on three benchmark datasets.
Key to their success is the ability to generalize better from the examples where spurious correlations do not hold.
- Score: 23.73056953692978
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks may be susceptible to learning spurious correlations
that hold on average but not in atypical test samples. As with the recent
emergence of vision transformer (ViT) models, it remains underexplored how
spurious correlations are manifested in such architectures. In this paper, we
systematically investigate the robustness of vision transformers to spurious
correlations on three challenging benchmark datasets and compare their
performance with popular CNNs. Our study reveals that when pre-trained on a
sufficiently large dataset, ViT models are more robust to spurious correlations
than CNNs. Key to their success is the ability to generalize better from the
examples where spurious correlations do not hold. Further, we perform extensive
ablations and experiments to understand the role of the self-attention
mechanism in providing robustness under spuriously correlated environments. We
hope that our work will inspire future research on further understanding the
robustness of ViT models.
Related papers
- Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility [46.171357375793235]
We identify high learning rates as a facilitator for simultaneously achieving robustness to spurious correlations and network compressibility.<n>Large learning rates produce desirable representation properties such as invariant feature utilization, class separation, and activation sparsity.<n>Our investigation of the mechanisms underlying this phenomenon reveals the importance of confident mispredictions of bias-conflicting samples under large learning rates.
arXiv Detail & Related papers (2025-07-23T17:59:02Z) - Comparative Analysis of Deep Learning Strategies for Hypertensive Retinopathy Detection from Fundus Images: From Scratch and Pre-trained Models [5.860609259063137]
This paper presents a comparative analysis of deep learning strategies for detecting hypertensive retinopathy from fundus images.<n>We investigate three distinct approaches: a custom CNN, a suite of pre-trained transformer-based models, and an AutoML solution.
arXiv Detail & Related papers (2025-06-14T13:11:33Z) - Autoencoder based approach for the mitigation of spurious correlations [2.7624021966289605]
Spurious correlations refer to erroneous associations in data that do not reflect true underlying relationships.
These correlations can lead deep neural networks (DNNs) to learn patterns that are not robust across diverse datasets or real-world scenarios.
We propose an autoencoder-based approach to analyze the nature of spurious correlations that exist in the Global Wheat Head Detection (GWHD) 2021 dataset.
arXiv Detail & Related papers (2024-06-27T05:28:44Z) - RAT: Retrieval-Augmented Transformer for Click-Through Rate Prediction [68.34355552090103]
This paper develops a Retrieval-Augmented Transformer (RAT), aiming to acquire fine-grained feature interactions within and across samples.
We then build Transformer layers with cascaded attention to capture both intra- and cross-sample feature interactions.
Experiments on real-world datasets substantiate the effectiveness of RAT and suggest its advantage in long-tail scenarios.
arXiv Detail & Related papers (2024-04-02T19:14:23Z) - Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm.
FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - Towards Flexible Inductive Bias via Progressive Reparameterization
Scheduling [25.76814731638375]
There are two de facto standard architectures in computer vision: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs)
We show these approaches overlook that the optimal inductive bias also changes according to the target data scale changes.
The more convolution-like inductive bias is included in the model, the smaller the data scale is required where the ViT-like model outperforms the ResNet performance.
arXiv Detail & Related papers (2022-10-04T04:20:20Z) - Explicit Tradeoffs between Adversarial and Natural Distributional
Robustness [48.44639585732391]
In practice, models need to enjoy both types of robustness to ensure reliability.
In this work, we show that in fact, explicit tradeoffs exist between adversarial and natural distributional robustness.
arXiv Detail & Related papers (2022-09-15T19:58:01Z) - Large-scale Robustness Analysis of Video Action Recognition Models [10.017292176162302]
We study robustness of six state-of-the-art action recognition models against 90 different perturbations.
The study reveals some interesting findings, 1) transformer based models are consistently more robust compared to CNN based models, 2) Pretraining improves robustness for Transformer based models more than CNN based models, and 3) All of the studied models are robust to temporal perturbations for all datasets but SSv2.
arXiv Detail & Related papers (2022-07-04T13:29:34Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.