Recur, Attend or Convolve? Frame Dependency Modeling Matters for
Cross-Domain Robustness in Action Recognition
- URL: http://arxiv.org/abs/2112.12175v1
- Date: Wed, 22 Dec 2021 19:11:53 GMT
- Title: Recur, Attend or Convolve? Frame Dependency Modeling Matters for
Cross-Domain Robustness in Action Recognition
- Authors: Sofia Broom\'e, Ernest Pokropek, Boyu Li, Hedvig Kjellstr\"om
- Abstract summary: Previous results have shown that 2D Convolutional Neural Networks (CNNs) tend to be biased toward texture rather than shape for various computer vision tasks.
This raises suspicion that large video models learn spurious correlations rather than to track relevant shapes over time.
We study the cross-domain robustness for recurrent, attention-based and convolutional video models, respectively, to investigate whether this robustness is influenced by the frame dependency modeling.
- Score: 0.5448283690603357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most action recognition models today are highly parameterized, and evaluated
on datasets with predominantly spatially distinct classes. Previous results for
single images have shown that 2D Convolutional Neural Networks (CNNs) tend to
be biased toward texture rather than shape for various computer vision tasks
(Geirhos et al., 2019), reducing generalization. Taken together, this raises
suspicion that large video models learn spurious correlations rather than to
track relevant shapes over time and infer generalizable semantics from their
movement. A natural way to avoid parameter explosion when learning visual
patterns over time is to make use of recurrence across the time-axis. In this
article, we empirically study the cross-domain robustness for recurrent,
attention-based and convolutional video models, respectively, to investigate
whether this robustness is influenced by the frame dependency modeling. Our
novel Temporal Shape dataset is proposed as a light-weight dataset to assess
the ability to generalize across temporal shapes which are not revealed from
single frames. We find that when controlling for performance and layer
structure, recurrent models show better out-of-domain generalization ability on
the Temporal Shape dataset than convolution- and attention-based models.
Moreover, our experiments indicate that convolution- and attention-based models
exhibit more texture bias on Diving48 than recurrent models.
Related papers
- OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive
Learning [67.07363529640784]
We propose OpenSTL to categorize prevalent approaches into recurrent-based and recurrent-free models.
We conduct standard evaluations on datasets across various domains, including synthetic moving object trajectory, human motion, driving scenes, traffic flow and forecasting weather.
We find that recurrent-free models achieve a good balance between efficiency and performance than recurrent models.
arXiv Detail & Related papers (2023-06-20T03:02:14Z) - SeqLink: A Robust Neural-ODE Architecture for Modelling Partially Observed Time Series [11.261457967759688]
We introduce SeqLink, an innovative neural architecture designed to enhance the robustness of sequence representation.
We demonstrate that SeqLink improves the modelling of intermittent time series, consistently outperforming state-of-the-art approaches.
arXiv Detail & Related papers (2022-12-07T10:25:59Z) - Learning to Reconstruct Missing Data from Spatiotemporal Graphs with
Sparse Observations [11.486068333583216]
This paper tackles the problem of learning effective models to reconstruct missing data points.
We propose a class of attention-based architectures, that given a set of highly sparse observations, learn a representation for points in time and space.
Compared to the state of the art, our model handles sparse data without propagating prediction errors or requiring a bidirectional model to encode forward and backward time dependencies.
arXiv Detail & Related papers (2022-05-26T16:40:48Z) - Temporal Relevance Analysis for Video Action Models [70.39411261685963]
We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models.
We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected.
arXiv Detail & Related papers (2022-04-25T19:06:48Z) - Object-centric and memory-guided normality reconstruction for video
anomaly detection [56.64792194894702]
This paper addresses anomaly detection problem for videosurveillance.
Due to the inherent rarity and heterogeneity of abnormal events, the problem is viewed as a normality modeling strategy.
Our model learns object-centric normal patterns without seeing anomalous samples during training.
arXiv Detail & Related papers (2022-03-07T19:28:39Z) - Deep Generative model with Hierarchical Latent Factors for Time Series
Anomaly Detection [40.21502451136054]
This work presents DGHL, a new family of generative models for time series anomaly detection.
A top-down Convolution Network maps a novel hierarchical latent space to time series windows, exploiting temporal dynamics to encode information efficiently.
Our method outperformed current state-of-the-art models on four popular benchmark datasets.
arXiv Detail & Related papers (2022-02-15T17:19:44Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Anomaly Detection of Time Series with Smoothness-Inducing Sequential
Variational Auto-Encoder [59.69303945834122]
We present a Smoothness-Inducing Sequential Variational Auto-Encoder (SISVAE) model for robust estimation and anomaly detection of time series.
Our model parameterizes mean and variance for each time-stamp with flexible neural networks.
We show the effectiveness of our model on both synthetic datasets and public real-world benchmarks.
arXiv Detail & Related papers (2021-02-02T06:15:15Z) - Unsupervised Video Decomposition using Spatio-temporal Iterative
Inference [31.97227651679233]
Multi-object scene decomposition is a fast-emerging problem in learning.
We show that our model has a high accuracy even without color information.
We demonstrate the decomposition, segmentation prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets.
arXiv Detail & Related papers (2020-06-25T22:57:17Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.