Efficient Spatialtemporal Context Modeling for Action Recognition
- URL: http://arxiv.org/abs/2103.11190v1
- Date: Sat, 20 Mar 2021 14:48:12 GMT
- Title: Efficient Spatialtemporal Context Modeling for Action Recognition
- Authors: Congqi Cao, Yue Lu, Yifan Zhang, Dongmei Jiang and Yanning Zhang
- Abstract summary: We propose a recurrent 3D criss-cross attention (RCCA-3D) module to model the dense long-range contextual information video for action recognition.
We model the relationship between points in the same line along the direction of horizon, vertical and depth at each time, which forms a 3D criss-cross structure.
Compared with the non-local method, the proposed RCCA-3D module reduces the number of parameters and FLOPs by 25% and 11% for the video context modeling.
- Score: 42.30158166919919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual information plays an important role in action recognition. Local
operations have difficulty to model the relation between two elements with a
long-distance interval. However, directly modeling the contextual information
between any two points brings huge cost in computation and memory, especially
for action recognition, where there is an additional temporal dimension.
Inspired from 2D criss-cross attention used in segmentation task, we propose a
recurrent 3D criss-cross attention (RCCA-3D) module to model the dense
long-range spatiotemporal contextual information in video for action
recognition. The global context is factorized into sparse relation maps. We
model the relationship between points in the same line along the direction of
horizon, vertical and depth at each time, which forms a 3D criss-cross
structure, and duplicate the same operation with recurrent mechanism to
transmit the relation between points in a line to a plane finally to the whole
spatiotemporal space. Compared with the non-local method, the proposed RCCA-3D
module reduces the number of parameters and FLOPs by 25% and 11% for video
context modeling. We evaluate the performance of RCCA-3D with two latest action
recognition networks on three datasets and make a thorough analysis of the
architecture, obtaining the best way to factorize and fuse the relation maps.
Comparisons with other state-of-the-art methods demonstrate the effectiveness
and efficiency of our model.
Related papers
- Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - LocATe: End-to-end Localization of Actions in 3D with Transformers [91.28982770522329]
LocATe is an end-to-end approach that jointly localizes and recognizes actions in a 3D sequence.
Unlike transformer-based object-detection and classification models which consider image or patch features as input, LocATe's transformer model is capable of capturing long-term correlations between actions in a sequence.
We introduce a new, challenging, and more realistic benchmark dataset, BABEL-TAL-20 (BT20), where the performance of state-of-the-art methods is significantly worse.
arXiv Detail & Related papers (2022-03-21T03:35:32Z) - Spot What Matters: Learning Context Using Graph Convolutional Networks
for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video.
Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training.
Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z) - Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization
for Efficient Video Classification [12.787763599624173]
We propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D.
Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both SomethingSomething and Kinetics.
arXiv Detail & Related papers (2020-12-01T07:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.