Out of Distribution Performance of State of Art Vision Model
- URL: http://arxiv.org/abs/2301.10750v3
- Date: Sun, 8 Oct 2023 22:01:52 GMT
- Title: Out of Distribution Performance of State of Art Vision Model
- Authors: Salman Rahman and Wonkwon Lee
- Abstract summary: ViT's self-attention mechanism, according to the claim, makes it more robust than CNN.
We investigate the performance of 58 state-of-the-art computer vision models in a unified training setup.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The vision transformer (ViT) has advanced to the cutting edge in the visual
recognition task. Transformers are more robust than CNN, according to the
latest research. ViT's self-attention mechanism, according to the claim, makes
it more robust than CNN. Even with this, we discover that these conclusions are
based on unfair experimental conditions and just comparing a few models, which
did not allow us to depict the entire scenario of robustness performance. In
this study, we investigate the performance of 58 state-of-the-art computer
vision models in a unified training setup based not only on attention and
convolution mechanisms but also on neural networks based on a combination of
convolution and attention mechanisms, sequence-based model, complementary
search, and network-based method. Our research demonstrates that robustness
depends on the training setup and model types, and performance varies based on
out-of-distribution type. Our research will aid the community in better
understanding and benchmarking the robustness of computer vision models.
Related papers
- Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Improving Network Interpretability via Explanation Consistency Evaluation [56.14036428778861]
We propose a framework that acquires more explainable activation heatmaps and simultaneously increase the model performance.
Specifically, our framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning.
Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations.
arXiv Detail & Related papers (2024-08-08T17:20:08Z) - Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition [49.14350399025926]
We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition.
Middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images.
arXiv Detail & Related papers (2024-07-28T11:52:36Z) - Distilling Knowledge from CNN-Transformer Models for Enhanced Human
Action Recognition [1.8722948221596285]
The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models.
The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model.
The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images.
arXiv Detail & Related papers (2023-11-02T14:57:58Z) - Exploring Model Transferability through the Lens of Potential Energy [78.60851825944212]
Transfer learning has become crucial in computer vision tasks due to the vast availability of pre-trained deep learning models.
Existing methods for measuring the transferability of pre-trained models rely on statistical correlations between encoded static features and task labels.
We present an insightful physics-inspired approach named PED to address these challenges.
arXiv Detail & Related papers (2023-08-29T07:15:57Z) - Interpretable Computer Vision Models through Adversarial Training:
Unveiling the Robustness-Interpretability Connection [0.0]
Interpretability is as essential as robustness when we deploy the models to the real world.
Standard models, compared to robust are more susceptible to adversarial attacks, and their learned representations are less meaningful to humans.
arXiv Detail & Related papers (2023-07-04T13:51:55Z) - Large-scale Robustness Analysis of Video Action Recognition Models [10.017292176162302]
We study robustness of six state-of-the-art action recognition models against 90 different perturbations.
The study reveals some interesting findings, 1) transformer based models are consistently more robust compared to CNN based models, 2) Pretraining improves robustness for Transformer based models more than CNN based models, and 3) All of the studied models are robust to temporal perturbations for all datasets but SSv2.
arXiv Detail & Related papers (2022-07-04T13:29:34Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - A Comprehensive Study of Vision Transformers on Dense Prediction Tasks [10.013443811899466]
Convolutional Neural Networks (CNNs) have been the standard choice in vision tasks.
Recent studies have shown that Vision Transformers (VTs) achieve comparable performance in challenging tasks such as object detection and semantic segmentation.
This poses several questions about their generalizability, robustness, reliability, and texture bias when used to extract features for complex tasks.
arXiv Detail & Related papers (2022-01-21T13:18:16Z) - Inducing Causal Structure for Interpretable Neural Networks [23.68246698789134]
We present the new method of interchange intervention training(IIT)
In IIT, we (1)align variables in the causal model with representations in the neural model and (2) train a neural model to match the counterfactual behavior of the causal model on a base input.
IIT is fully differentiable, flexibly combines with other objectives, and guarantees that the target causal model is acausal abstraction of the neural model.
arXiv Detail & Related papers (2021-12-01T21:07:01Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.