Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph
Generation
- URL: http://arxiv.org/abs/2309.13237v3
- Date: Fri, 15 Dec 2023 08:42:04 GMT
- Title: Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph
Generation
- Authors: Tao Pu, Tianshui Chen, Hefeng Wu, Yongyi Lu, Liang Lin
- Abstract summary: Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video.
Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images.
We propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism.
- Score: 64.85974098314344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video scene graph generation (VidSGG) aims to identify objects in visual
scenes and infer their relationships for a given video. It requires not only a
comprehensive understanding of each object scattered on the whole scene but
also a deep dive into their temporal motions and interactions. Inherently,
object pairs and their relationships enjoy spatial co-occurrence correlations
within each image and temporal consistency/transition correlations across
different images, which can serve as prior knowledge to facilitate VidSGG model
learning and inference. In this work, we propose a spatial-temporal
knowledge-embedded transformer (STKET) that incorporates the prior
spatial-temporal knowledge into the multi-head cross-attention mechanism to
learn more representative relationship representations. Specifically, we first
learn spatial co-occurrence and temporal transition correlations in a
statistical manner. Then, we design spatial and temporal knowledge-embedded
layers that introduce the multi-head cross-attention mechanism to fully explore
the interaction between visual representation and the knowledge to generate
spatial- and temporal-embedded representations, respectively. Finally, we
aggregate these representations for each subject-object pair to predict the
final semantic labels and their relationships. Extensive experiments show that
STKET outperforms current competing algorithms by a large margin, e.g.,
improving the mR@50 by 8.1%, 4.7%, and 2.1% on different settings over current
algorithms.
Related papers
- Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Learning Appearance-motion Normality for Video Anomaly Detection [11.658792932975652]
We propose spatial-temporal memories augmented two-stream auto-encoder framework.
It learns the appearance normality and motion normality independently and explores the correlations via adversarial learning.
Our framework outperforms the state-of-the-art methods, achieving AUCs of 98.1% and 89.8% on UCSD Ped2 and CUHK Avenue datasets.
arXiv Detail & Related papers (2022-07-27T08:30:19Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.