Cross-Modality Time-Variant Relation Learning for Generating Dynamic
Scene Graphs
- URL: http://arxiv.org/abs/2305.08522v1
- Date: Mon, 15 May 2023 10:30:38 GMT
- Title: Cross-Modality Time-Variant Relation Learning for Generating Dynamic
Scene Graphs
- Authors: Jingyi Wang, Jinfa Huang, Can Zhang, and Zhidong Deng
- Abstract summary: We propose a Time-variant Relation-aware TRansformer (TR$2$) to model the temporal change of relations in dynamic scene graphs.
We show that TR$2$ significantly outperforms previous state-of-the-art methods under two different settings.
- Score: 16.760066844287046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dynamic scene graphs generated from video clips could help enhance the
semantic visual understanding in a wide range of challenging tasks such as
environmental perception, autonomous navigation, and task planning of
self-driving vehicles and mobile robots. In the process of temporal and spatial
modeling during dynamic scene graph generation, it is particularly intractable
to learn time-variant relations in dynamic scene graphs among frames. In this
paper, we propose a Time-variant Relation-aware TRansformer (TR$^2$), which
aims to model the temporal change of relations in dynamic scene graphs.
Explicitly, we leverage the difference of text embeddings of prompted sentences
about relation labels as the supervision signal for relations. In this way,
cross-modality feature guidance is realized for the learning of time-variant
relations. Implicitly, we design a relation feature fusion module with a
transformer and an additional message token that describes the difference
between adjacent frames. Extensive experiments on the Action Genome dataset
prove that our TR$^2$ can effectively model the time-variant relations. TR$^2$
significantly outperforms previous state-of-the-art methods under two different
settings by 2.1% and 2.6% respectively.
Related papers
- Towards Scene Graph Anticipation [10.678727237318503]
We introduce the task of Scene Graph Anticipation (SGA)
We adapt state-of-the-art scene graph generation methods as baselines to anticipate future pair-wise relationships between objects.
In SceneSayer, we leverage object-centric representations of relationships to reason about the observed video frames and model the evolution of relationships between objects.
arXiv Detail & Related papers (2024-03-07T21:08:51Z) - TimeGraphs: Graph-based Temporal Reasoning [64.18083371645956]
TimeGraphs is a novel approach that characterizes dynamic interactions as a hierarchical temporal graph.
Our approach models the interactions using a compact graph-based representation, enabling adaptive reasoning across diverse time scales.
We evaluate TimeGraphs on multiple datasets with complex, dynamic agent interactions, including a football simulator, the Resistance game, and the MOMA human activity dataset.
arXiv Detail & Related papers (2024-01-06T06:26:49Z) - Local-Global Information Interaction Debiasing for Dynamic Scene Graph
Generation [51.92419880088668]
We propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information.
Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates.
arXiv Detail & Related papers (2023-08-10T01:24:25Z) - DyTed: Disentangled Representation Learning for Discrete-time Dynamic
Graph [59.583555454424]
We propose a novel disenTangled representation learning framework for discrete-time Dynamic graphs, namely DyTed.
We specially design a temporal-clips contrastive learning task together with a structure contrastive learning to effectively identify the time-invariant and time-varying representations respectively.
arXiv Detail & Related papers (2022-10-19T14:34:12Z) - Time-aware Dynamic Graph Embedding for Asynchronous Structural Evolution [60.695162101159134]
Existing works merely view a dynamic graph as a sequence of changes.
We formulate dynamic graphs as temporal edge sequences associated with joining time of.
vertex and timespan of edges.
A time-aware Transformer is proposed to embed.
vertex' dynamic connections and ToEs into the learned.
vertex representations.
arXiv Detail & Related papers (2022-07-01T15:32:56Z) - Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs [15.614710220461353]
We show that capturing long-term dependencies is the key to effective generation of dynamic scene graphs.
Experimental results demonstrate that our Dynamic Scene Graph Detection Transformer (DSG-DETR) outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-12-18T03:02:11Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z) - TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning [87.38675639186405]
We propose a novel graph neural network approach, called TCL, which deals with the dynamically-evolving graph in a continuous-time fashion.
To the best of our knowledge, this is the first attempt to apply contrastive learning to representation learning on dynamic graphs.
arXiv Detail & Related papers (2021-05-17T15:33:25Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Understanding Dynamic Scenes using Graph Convolution Networks [22.022759283770377]
We present a novel framework to model on-road vehicle behaviors from a sequence of temporally ordered frames as grabbed by a moving camera.
We show a seamless transfer of learning to multiple datasets without resorting to fine-tuning.
Such behavior prediction methods find immediate relevance in a variety of navigation tasks.
arXiv Detail & Related papers (2020-05-09T13:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.