Spatial-Temporal Transformer for Dynamic Scene Graph Generation
- URL: http://arxiv.org/abs/2107.12309v1
- Date: Mon, 26 Jul 2021 16:30:30 GMT
- Title: Spatial-Temporal Transformer for Dynamic Scene Graph Generation
- Authors: Yuren Cong, Wentong Liao, Hanno Ackermann, Michael Ying Yang, Bodo
Rosenhahn
- Abstract summary: We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
- Score: 34.190733855032065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dynamic scene graph generation aims at generating a scene graph of the given
video. Compared to the task of scene graph generation from images, it is more
challenging because of the dynamic relationships between objects and the
temporal dependencies between frames allowing for a richer semantic
interpretation. In this paper, we propose Spatial-temporal Transformer
(STTran), a neural network that consists of two core modules: (1) a spatial
encoder that takes an input frame to extract spatial context and reason about
the visual relationships within a frame, and (2) a temporal decoder which takes
the output of the spatial encoder as input in order to capture the temporal
dependencies between frames and infer the dynamic relationships. Furthermore,
STTran is flexible to take varying lengths of videos as input without clipping,
which is especially important for long videos. Our method is validated on the
benchmark dataset Action Genome (AG). The experimental results demonstrate the
superior performance of our method in terms of dynamic scene graphs. Moreover,
a set of ablative studies is conducted and the effect of each proposed module
is justified.
Related papers
- CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos [9.807247838436489]
We introduce the new AeroEye dataset that focuses on multi-object relationship modeling in aerial videos.
We propose the novel Cyclic Graph Transformer (CYCLO) approach that allows the model to capture both direct and long-range temporal dependencies.
The proposed approach also allows one to handle sequences with inherent cyclical patterns and process object relationships in the correct sequential order.
arXiv Detail & Related papers (2024-06-03T06:24:55Z) - Local-Global Information Interaction Debiasing for Dynamic Scene Graph
Generation [51.92419880088668]
We propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information.
Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates.
arXiv Detail & Related papers (2023-08-10T01:24:25Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - Cross-Modality Time-Variant Relation Learning for Generating Dynamic
Scene Graphs [16.760066844287046]
We propose a Time-variant Relation-aware TRansformer (TR$2$) to model the temporal change of relations in dynamic scene graphs.
We show that TR$2$ significantly outperforms previous state-of-the-art methods under two different settings.
arXiv Detail & Related papers (2023-05-15T10:30:38Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in
the Wild [19.5702895176141]
We propose a method for capturing discnative features within each frame model.
We utilize the CNN to translate each frame into a visual feature sequence.
Experiments indicate that our method provides an effective way to make use of the spatial and temporal dependencies.
arXiv Detail & Related papers (2022-05-10T08:47:15Z) - Motion-aware Dynamic Graph Neural Network for Video Compressive Sensing [14.67994875448175]
Video snapshot imaging (SCI) utilizes a 2D detector to capture sequential video frames and compress them into a single measurement.
Most existing reconstruction methods are incapable of efficiently capturing long-range spatial and temporal dependencies.
We propose a flexible and robust approach based on the graph neural network (GNN) to efficiently model non-local interactions between pixels in space and time regardless of the distance.
arXiv Detail & Related papers (2022-03-01T12:13:46Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs [15.614710220461353]
We show that capturing long-term dependencies is the key to effective generation of dynamic scene graphs.
Experimental results demonstrate that our Dynamic Scene Graph Detection Transformer (DSG-DETR) outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-12-18T03:02:11Z) - StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN [70.31913835035206]
We present a novel approach to the video synthesis problem that helps to greatly improve visual quality.
We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for.
Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
arXiv Detail & Related papers (2021-07-15T09:58:15Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.