Revisiting spatio-temporal layouts for compositional action recognition
- URL: http://arxiv.org/abs/2111.01936v1
- Date: Tue, 2 Nov 2021 23:04:39 GMT
- Title: Revisiting spatio-temporal layouts for compositional action recognition
- Authors: Gorjan Radevski, Marie-Francine Moens, Tinne Tuytelaars
- Abstract summary: We take an object-centric approach to action recognition.
The main focus of this paper is compositional/few-shot action recognition.
We demonstrate how to improve the performance of appearance-based models by fusion with layout-based models.
- Score: 63.04778884595353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recognizing human actions is fundamentally a spatio-temporal reasoning
problem, and should be, at least to some extent, invariant to the appearance of
the human and the objects involved. Motivated by this hypothesis, in this work,
we take an object-centric approach to action recognition. Multiple works have
studied this setting before, yet it remains unclear (i) how well a carefully
crafted, spatio-temporal layout-based method can recognize human actions, and
(ii) how, and when, to fuse the information from layout and appearance-based
models. The main focus of this paper is compositional/few-shot action
recognition, where we advocate the usage of multi-head attention (proven to be
effective for spatial reasoning) over spatio-temporal layouts, i.e.,
configurations of object bounding boxes. We evaluate different schemes to
inject video appearance information to the system, and benchmark our approach
on background cluttered action recognition. On the Something-Else and Action
Genome datasets, we demonstrate (i) how to extend multi-head attention for
spatio-temporal layout-based action recognition, (ii) how to improve the
performance of appearance-based models by fusion with layout-based models,
(iii) that even on non-compositional background-cluttered video datasets, a
fusion between layout- and appearance-based models improves the performance.
Related papers
- Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose
Estimation [16.32910684198013]
We present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem.
We show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model.
arXiv Detail & Related papers (2023-07-31T14:00:23Z) - Spatio-Temporal Relation Learning for Video Anomaly Detection [35.59510027883497]
Anomaly identification is highly dependent on the relationship between the object and the scene.
In this paper, we propose a Spatial-Temporal Relation Learning framework to tackle the video anomaly detection task.
Experiments are conducted on three public datasets, and the superior performance over the state-of-the-art methods demonstrates the effectiveness of our method.
arXiv Detail & Related papers (2022-09-27T02:19:31Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Object-centric and memory-guided normality reconstruction for video
anomaly detection [56.64792194894702]
This paper addresses anomaly detection problem for videosurveillance.
Due to the inherent rarity and heterogeneity of abnormal events, the problem is viewed as a normality modeling strategy.
Our model learns object-centric normal patterns without seeing anomalous samples during training.
arXiv Detail & Related papers (2022-03-07T19:28:39Z) - Self-Attention Neural Bag-of-Features [103.70855797025689]
We build on the recently introduced 2D-Attention and reformulate the attention learning methodology.
We propose a joint feature-temporal attention mechanism that learns a joint 2D attention mask highlighting relevant information.
arXiv Detail & Related papers (2022-01-26T17:54:14Z) - Skeleton-Based Mutually Assisted Interacted Object Localization and
Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data.
Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z) - Recurrent Attention Models with Object-centric Capsule Representation
for Multi-object Recognition [4.143091738981101]
We show that an object-centric hidden representation in an encoder-decoder model with iterative glimpse attention yields effective integration of attention and recognition.
Our work takes a step toward a general architecture for how to integrate recurrent object-centric representation into the planning of attentional glimpses.
arXiv Detail & Related papers (2021-10-11T01:41:21Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.