Large-scale Robustness Analysis of Video Action Recognition Models
- URL: http://arxiv.org/abs/2207.01398v2
- Date: Fri, 7 Apr 2023 16:40:59 GMT
- Title: Large-scale Robustness Analysis of Video Action Recognition Models
- Authors: Madeline Chantry Schiappa, Naman Biyani, Prudvi Kamtam, Shruti Vyas,
Hamid Palangi, Vibhav Vineet, Yogesh Rawat
- Abstract summary: We study robustness of six state-of-the-art action recognition models against 90 different perturbations.
The study reveals some interesting findings, 1) transformer based models are consistently more robust compared to CNN based models, 2) Pretraining improves robustness for Transformer based models more than CNN based models, and 3) All of the studied models are robust to temporal perturbations for all datasets but SSv2.
- Score: 10.017292176162302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We have seen a great progress in video action recognition in recent years.
There are several models based on convolutional neural network (CNN) and some
recent transformer based approaches which provide top performance on existing
benchmarks. In this work, we perform a large-scale robustness analysis of these
existing models for video action recognition. We focus on robustness against
real-world distribution shift perturbations instead of adversarial
perturbations. We propose four different benchmark datasets, HMDB51-P,
UCF101-P, Kinetics400-P, and SSv2-P to perform this analysis. We study
robustness of six state-of-the-art action recognition models against 90
different perturbations. The study reveals some interesting findings, 1)
transformer based models are consistently more robust compared to CNN based
models, 2) Pretraining improves robustness for Transformer based models more
than CNN based models, and 3) All of the studied models are robust to temporal
perturbations for all datasets but SSv2; suggesting the importance of temporal
information for action recognition varies based on the dataset and activities.
Next, we study the role of augmentations in model robustness and present a
real-world dataset, UCF101-DS, which contains realistic distribution shifts, to
further validate some of these findings. We believe this study will serve as a
benchmark for future research in robust video action recognition.
Related papers
- Adversarial Augmentation Training Makes Action Recognition Models More
Robust to Realistic Video Distribution Shifts [13.752169303624147]
Action recognition models often lack robustness when faced with natural distribution shifts between training and test data.
We propose two novel evaluation methods to assess model resilience to such distribution disparity.
We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models.
arXiv Detail & Related papers (2024-01-21T05:50:39Z) - Robustness Analysis on Foundational Segmentation Models [28.01242494123917]
In this work, we perform a robustness analysis of Visual Foundation Models (VFMs) for segmentation tasks.
We benchmark seven state-of-the-art segmentation architectures using 2 different datasets.
Our findings reveal several key insights: VFMs exhibit vulnerabilities to compression-induced corruptions, despite not outpacing all of unimodal models in robustness, multimodal models show competitive resilience in zero-shot scenarios, and VFMs demonstrate enhanced robustness for certain object categories.
arXiv Detail & Related papers (2023-06-15T16:59:42Z) - Reliability in Semantic Segmentation: Are We on the Right Track? [15.0189654919665]
We analyze a broad variety of models, spanning from older ResNet-based architectures to novel transformers.
We find that while recent models are significantly more robust, they are not overall more reliable in terms of uncertainty estimation.
This is the first study on modern segmentation models focused on both robustness and uncertainty estimation.
arXiv Detail & Related papers (2023-03-20T17:38:24Z) - Out of Distribution Performance of State of Art Vision Model [0.0]
ViT's self-attention mechanism, according to the claim, makes it more robust than CNN.
We investigate the performance of 58 state-of-the-art computer vision models in a unified training setup.
arXiv Detail & Related papers (2023-01-25T18:14:49Z) - An Empirical Study of Deep Learning Models for Vulnerability Detection [4.243592852049963]
We surveyed and reproduced 9 state-of-the-art deep learning models on 2 widely used vulnerability detection datasets.
We investigated model capabilities, training data, and model interpretation.
Our findings can help better understand model results, provide guidance on preparing training data, and improve the robustness of the models.
arXiv Detail & Related papers (2022-12-15T19:49:34Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - From Environmental Sound Representation to Robustness of 2D CNN Models
Against Adversarial Attacks [82.21746840893658]
This paper investigates the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network.
We show that while the ResNet-18 model trained on DWT spectrograms achieves a high recognition accuracy, attacking this model is relatively more costly for the adversary.
arXiv Detail & Related papers (2022-04-14T15:14:08Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Firearm Detection via Convolutional Neural Networks: Comparing a
Semantic Segmentation Model Against End-to-End Solutions [68.8204255655161]
Threat detection of weapons and aggressive behavior from live video can be used for rapid detection and prevention of potentially deadly incidents.
One way for achieving this is through the use of artificial intelligence and, in particular, machine learning for image analysis.
We compare a traditional monolithic end-to-end deep learning model and a previously proposed model based on an ensemble of simpler neural networks detecting fire-weapons via semantic segmentation.
arXiv Detail & Related papers (2020-12-17T15:19:29Z) - From Sound Representation to Model Robustness [82.21746840893658]
We investigate the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network.
Averaged over various experiments on three environmental sound datasets, we found the ResNet-18 model outperforms other deep learning architectures.
arXiv Detail & Related papers (2020-07-27T17:30:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.