Related papers: Spatiotemporal Event Graphs for Dynamic Scene Understanding

Spatiotemporal Event Graphs for Dynamic Scene Understanding

URL: http://arxiv.org/abs/2312.07621v1
Date: Mon, 11 Dec 2023 22:30:13 GMT
Title: Spatiotemporal Event Graphs for Dynamic Scene Understanding
Authors: Salman Khan
Abstract summary: We present a series of frameworks for dynamic scene understanding starting from an autonomous driving perspective to complex video activity detection. We propose a hybrid graph neural network that combines attention applied to a graph encoding the local (short-term) dynamic scene with a temporal graph modelling overall long-term activity.
Score: 14.735329256577101
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dynamic scene understanding is the ability of a computer system to interpret and make sense of the visual information present in a video of a real-world scene. In this thesis, we present a series of frameworks for dynamic scene understanding starting from road event detection from an autonomous driving perspective to complex video activity detection, followed by continual learning approaches for the life-long learning of the models. Firstly, we introduce the ROad event Awareness Dataset (ROAD) for Autonomous Driving, to our knowledge the first of its kind. Due to the lack of datasets equipped with formally specified logical requirements, we also introduce the ROad event Awareness Dataset with logical Requirements (ROAD-R), the first publicly available dataset for autonomous driving with requirements expressed as logical constraints, as a tool for driving neurosymbolic research in the area. Next, we extend event detection to holistic scene understanding by proposing two complex activity detection methods. In the first method, we present a deformable, spatiotemporal scene graph approach, consisting of three main building blocks: action tube detection, a 3D deformable RoI pooling layer designed for learning the flexible, deformable geometry of the constituent action tubes, and a scene graph constructed by considering all parts as nodes and connecting them based on different semantics. In a second approach evolving from the first, we propose a hybrid graph neural network that combines attention applied to a graph encoding of the local (short-term) dynamic scene with a temporal graph modelling the overall long-duration activity. Finally, the last part of the thesis is about presenting a new continual semi-supervised learning (CSSL) paradigm.

Related papers

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes. By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes. We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z)
Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning. voxelization infers per-object occupancy probabilities at individual spatial locations. Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z)
Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns. A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z)
PreGSU-A Generalized Traffic Scene Understanding Model for Autonomous Driving based on Pre-trained Graph Attention Network [23.38434020807342]
Scene understanding, defined as learning, extraction, and representation of interactions among traffic elements, is one of the critical challenges toward high-level autonomous driving (AD) Current scene understanding methods mainly focus on one concrete single task, such as trajectory prediction and risk level evaluation. We propose PreGSU, a generalized pre-trained scene understanding model based on graph attention network to learn the universal interaction and reasoning of traffic scenes to support various downstream tasks.
arXiv Detail & Related papers (2024-04-16T03:34:35Z)
A Hybrid Graph Network for Complex Activity Detection in Video [40.843533889724924]
Complex Activity Detection (CompAD) extends analysis to long-term activities. We propose a hybrid graph neural network which combines attention applied to a graph encoding the local (short-term) dynamic scene with a temporal graph modelling the overall long-duration activity.
arXiv Detail & Related papers (2023-10-26T15:49:35Z)
Spatio-Temporal Relation Learning for Video Anomaly Detection [35.59510027883497]
Anomaly identification is highly dependent on the relationship between the object and the scene. In this paper, we propose a Spatial-Temporal Relation Learning framework to tackle the video anomaly detection task. Experiments are conducted on three public datasets, and the superior performance over the state-of-the-art methods demonstrates the effectiveness of our method.
arXiv Detail & Related papers (2022-09-27T02:19:31Z)
Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query. Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information. We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z)
Learning Multi-Object Dynamics with Compositional Neural Radiance Fields [63.424469458529906]
We present a method to learn compositional predictive models from image observations based on implicit object encoders, Neural Radiance Fields (NeRFs), and graph neural networks. NeRFs have become a popular choice for representing scenes due to their strong 3D prior. For planning, we utilize RRTs in the learned latent space, where we can exploit our model and the implicit object encoder to make sampling the latent space informative and more efficient.
arXiv Detail & Related papers (2022-02-24T01:31:29Z)
Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video. Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training. Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z)
3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations. A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
Understanding Dynamic Scenes using Graph Convolution Networks [22.022759283770377]
We present a novel framework to model on-road vehicle behaviors from a sequence of temporally ordered frames as grabbed by a moving camera. We show a seamless transfer of learning to multiple datasets without resorting to fine-tuning. Such behavior prediction methods find immediate relevance in a variety of navigation tasks.
arXiv Detail & Related papers (2020-05-09T13:05:06Z)
3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans [27.747241700017728]
We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs. 3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction.
arXiv Detail & Related papers (2020-02-15T00:46:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.