Out of Distribution Performance of State of Art Vision Model
- URL: http://arxiv.org/abs/2301.10750v3
- Date: Sun, 8 Oct 2023 22:01:52 GMT
- Title: Out of Distribution Performance of State of Art Vision Model
- Authors: Salman Rahman and Wonkwon Lee
- Abstract summary: ViT's self-attention mechanism, according to the claim, makes it more robust than CNN.
We investigate the performance of 58 state-of-the-art computer vision models in a unified training setup.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The vision transformer (ViT) has advanced to the cutting edge in the visual
recognition task. Transformers are more robust than CNN, according to the
latest research. ViT's self-attention mechanism, according to the claim, makes
it more robust than CNN. Even with this, we discover that these conclusions are
based on unfair experimental conditions and just comparing a few models, which
did not allow us to depict the entire scenario of robustness performance. In
this study, we investigate the performance of 58 state-of-the-art computer
vision models in a unified training setup based not only on attention and
convolution mechanisms but also on neural networks based on a combination of
convolution and attention mechanisms, sequence-based model, complementary
search, and network-based method. Our research demonstrates that robustness
depends on the training setup and model types, and performance varies based on
out-of-distribution type. Our research will aid the community in better
understanding and benchmarking the robustness of computer vision models.
Related papers
- Distilling Knowledge from CNN-Transformer Models for Enhanced Human
Action Recognition [1.8722948221596285]
The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models.
The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model.
The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images.
arXiv Detail & Related papers (2023-11-02T14:57:58Z) - Exploring Model Transferability through the Lens of Potential Energy [78.60851825944212]
Transfer learning has become crucial in computer vision tasks due to the vast availability of pre-trained deep learning models.
Existing methods for measuring the transferability of pre-trained models rely on statistical correlations between encoded static features and task labels.
We present an insightful physics-inspired approach named PED to address these challenges.
arXiv Detail & Related papers (2023-08-29T07:15:57Z) - Interpretable Computer Vision Models through Adversarial Training:
Unveiling the Robustness-Interpretability Connection [0.0]
Interpretability is as essential as robustness when we deploy the models to the real world.
Standard models, compared to robust are more susceptible to adversarial attacks, and their learned representations are less meaningful to humans.
arXiv Detail & Related papers (2023-07-04T13:51:55Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - Large-scale Robustness Analysis of Video Action Recognition Models [10.017292176162302]
We study robustness of six state-of-the-art action recognition models against 90 different perturbations.
The study reveals some interesting findings, 1) transformer based models are consistently more robust compared to CNN based models, 2) Pretraining improves robustness for Transformer based models more than CNN based models, and 3) All of the studied models are robust to temporal perturbations for all datasets but SSv2.
arXiv Detail & Related papers (2022-07-04T13:29:34Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - A Comprehensive Study of Vision Transformers on Dense Prediction Tasks [10.013443811899466]
Convolutional Neural Networks (CNNs) have been the standard choice in vision tasks.
Recent studies have shown that Vision Transformers (VTs) achieve comparable performance in challenging tasks such as object detection and semantic segmentation.
This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks.
arXiv Detail & Related papers (2022-01-21T13:18:16Z) - Inducing Causal Structure for Interpretable Neural Networks [23.68246698789134]
We present the new method of interchange intervention training(IIT)
In IIT, we (1)align variables in the causal model with representations in the neural model and (2) train a neural model to match the counterfactual behavior of the causal model on a base input.
IIT is fully differentiable, flexibly combines with other objectives, and guarantees that the target causal model is acausal abstraction of the neural model.
arXiv Detail & Related papers (2021-12-01T21:07:01Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.