Spatio-Temporal Context for Action Detection
- URL: http://arxiv.org/abs/2106.15171v1
- Date: Tue, 29 Jun 2021 08:33:48 GMT
- Title: Spatio-Temporal Context for Action Detection
- Authors: Manuel Sarmiento Calder\'o, David Varas, Elisenda Bou-Balust
- Abstract summary: This work proposes to use non-aggregated temporal information.
The main contribution is the introduction of two cross attention blocks.
Experiments on the AVA dataset show the advantages of the proposed approach.
- Score: 2.294635424666456
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research in action detection has grown in the recentyears, as it plays a key
role in video understanding. Modelling the interactions (either spatial or
temporal) between actors and their context has proven to be essential for this
task. While recent works use spatial features with aggregated temporal
information, this work proposes to use non-aggregated temporal information.
This is done by adding an attention based method that leverages spatio-temporal
interactions between elements in the scene along the clip.The main contribution
of this work is the introduction of two cross attention blocks to effectively
model the spatial relations and capture short range temporal
interactions.Experiments on the AVA dataset show the advantages of the proposed
approach that models spatio-temporal relations between relevant elements in the
scene, outperforming other methods that model actor interactions with their
context by +0.31 mAP.
Related papers
- Spatial-Temporal Multi-level Association for Video Object Segmentation [89.32226483171047]
This paper proposes spatial-temporal multi-level association, which jointly associates reference frame, test frame, and object features.
Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features.
arXiv Detail & Related papers (2024-04-09T12:44:34Z) - On the Importance of Spatial Relations for Few-shot Action Recognition [109.2312001355221]
In this paper, we investigate the importance of spatial relations and propose a more accurate few-shot action recognition method.
A novel Spatial Alignment Cross Transformer (SA-CT) learns to re-adjust the spatial relations and incorporates the temporal information.
Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks.
arXiv Detail & Related papers (2023-08-14T12:58:02Z) - Extracting Fast and Slow: User-Action Embedding with Inter-temporal
Information [8.697025191437774]
We propose a method that analyzes user actions with intertemporal information (time interval)
We embed the user's action sequence and its time intervals to obtain a low-dimensional representation of the action along with intertemporal information.
This paper demonstrates that explicit modeling of action sequences and inter-temporal user behavior information enable successful interpretable analysis.
arXiv Detail & Related papers (2022-06-20T02:04:04Z) - Effective Actor-centric Human-object Interaction Detection [20.564689533862524]
We propose a novel actor-centric framework to detect Human-Object Interaction in images.
Our method achieves the state-of-the-art on the challenging V-COCO and HICO-DET benchmarks.
arXiv Detail & Related papers (2022-02-24T10:24:44Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z) - A Spatial-Temporal Attentive Network with Spatial Continuity for
Trajectory Prediction [74.00750936752418]
We propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC)
First, spatial-temporal attention mechanism is presented to explore the most useful and important information.
Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity.
arXiv Detail & Related papers (2020-03-13T04:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.