Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs
- URL: http://arxiv.org/abs/2112.09828v1
- Date: Sat, 18 Dec 2021 03:02:11 GMT
- Title: Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs
- Authors: Shengyu Feng, Subarna Tripathi, Hesham Mostafa, Marcel Nassar, Somdeb
Majumdar
- Abstract summary: We show that capturing long-term dependencies is the key to effective generation of dynamic scene graphs.
Experimental results demonstrate that our Dynamic Scene Graph Detection Transformer (DSG-DETR) outperforms state-of-the-art methods.
- Score: 15.614710220461353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Structured video representation in the form of dynamic scene graphs is an
effective tool for several video understanding tasks. Compared to the task of
scene graph generation from images, dynamic scene graph generation is more
challenging due to the temporal dynamics of the scene and the inherent temporal
fluctuations of predictions. We show that capturing long-term dependencies is
the key to effective generation of dynamic scene graphs. We present the
detect-track-recognize paradigm by constructing consistent long-term object
tracklets from a video, followed by transformers to capture the dynamics of
objects and visual relations. Experimental results demonstrate that our Dynamic
Scene Graph Detection Transformer (DSG-DETR) outperforms state-of-the-art
methods by a significant margin on the benchmark dataset Action Genome. We also
perform ablation studies and validate the effectiveness of each component of
the proposed approach.
Related papers
- Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation [10.678727237318503]
Impar, a novel training framework that leverages curriculum learning and loss masking to mitigate bias generation and anticipation modelling.
We introduce two new tasks, Robust Spatio-Temporal Scene Graph Generation and Robust Scene Graph Anticipation, designed to evaluate the robustness of STSG models against distribution shifts.
arXiv Detail & Related papers (2024-11-20T06:15:28Z) - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Retrieval Augmented Generation for Dynamic Graph Modeling [15.09162213134372]
Dynamic graph modeling is crucial for analyzing evolving patterns in various applications.
Existing approaches often integrate graph neural networks with temporal modules or redefine dynamic graph modeling as a generative sequence task.
We introduce the Retrieval-Augmented Generation for Dynamic Graph Modeling (RAG4DyG) framework, which leverages guidance from contextually and temporally analogous examples.
arXiv Detail & Related papers (2024-08-26T09:23:35Z) - TimeGraphs: Graph-based Temporal Reasoning [64.18083371645956]
TimeGraphs is a novel approach that characterizes dynamic interactions as a hierarchical temporal graph.
Our approach models the interactions using a compact graph-based representation, enabling adaptive reasoning across diverse time scales.
We evaluate TimeGraphs on multiple datasets with complex, dynamic agent interactions, including a football simulator, the Resistance game, and the MOMA human activity dataset.
arXiv Detail & Related papers (2024-01-06T06:26:49Z) - Local-Global Information Interaction Debiasing for Dynamic Scene Graph
Generation [51.92419880088668]
We propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information.
Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates.
arXiv Detail & Related papers (2023-08-10T01:24:25Z) - EasyDGL: Encode, Train and Interpret for Continuous-time Dynamic Graph Learning [92.71579608528907]
This paper aims to design an easy-to-use pipeline (termed as EasyDGL) composed of three key modules with both strong ability fitting and interpretability.
EasyDGL can effectively quantify the predictive power of frequency content that a model learn from the evolving graph data.
arXiv Detail & Related papers (2023-03-22T06:35:08Z) - Time-aware Dynamic Graph Embedding for Asynchronous Structural Evolution [60.695162101159134]
Existing works merely view a dynamic graph as a sequence of changes.
We formulate dynamic graphs as temporal edge sequences associated with joining time of.
vertex and timespan of edges.
A time-aware Transformer is proposed to embed.
vertex' dynamic connections and ToEs into the learned.
vertex representations.
arXiv Detail & Related papers (2022-07-01T15:32:56Z) - Efficient Dynamic Graph Representation Learning at Scale [66.62859857734104]
We propose Efficient Dynamic Graph lEarning (EDGE), which selectively expresses certain temporal dependency via training loss to improve the parallelism in computations.
We show that EDGE can scale to dynamic graphs with millions of nodes and hundreds of millions of temporal events and achieve new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2021-12-14T22:24:53Z) - Event Detection on Dynamic Graphs [4.128347119808724]
Event detection is a critical task for timely decision-making in graph analytics applications.
We propose DyGED, a simple yet novel deep learning model for event detection on dynamic graphs.
arXiv Detail & Related papers (2021-10-23T05:52:03Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.