GTA: Global Temporal Attention for Video Action Understanding
- URL: http://arxiv.org/abs/2012.08510v2
- Date: Thu, 8 Apr 2021 18:16:52 GMT
- Title: GTA: Global Temporal Attention for Video Action Understanding
- Authors: Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, Abhinav
Shrivastava
- Abstract summary: We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
- Score: 51.476605514802806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-attention learns pairwise interactions to model long-range dependencies,
yielding great improvements for video action recognition. In this paper, we
seek a deeper understanding of self-attention for temporal modeling in videos.
We first demonstrate that the entangled modeling of spatio-temporal information
by flattening all pixels is sub-optimal, failing to capture temporal
relationships among frames explicitly. To this end, we introduce Global
Temporal Attention (GTA), which performs global temporal attention on top of
spatial attention in a decoupled manner. We apply GTA on both pixels and
semantically similar regions to capture temporal relationships at different
levels of spatial granularity. Unlike conventional self-attention that computes
an instance-specific attention matrix, GTA directly learns a global attention
matrix that is intended to encode temporal structures that generalize across
different samples. We further augment GTA with a cross-channel multi-head
fashion to exploit channel interactions for better temporal modeling. Extensive
experiments on 2D and 3D networks demonstrate that our approach consistently
enhances temporal modeling and provides state-of-the-art performance on three
video action recognition datasets.
Related papers
- Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph
Generation [64.85974098314344]
Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video.
Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images.
We propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism.
arXiv Detail & Related papers (2023-09-23T02:40:28Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial
Video Classification [49.06447472006251]
We propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification.
Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results.
arXiv Detail & Related papers (2022-09-22T21:15:58Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.