(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
- URL: http://arxiv.org/abs/2202.09277v1
- Date: Fri, 18 Feb 2022 15:58:54 GMT
- Title: (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
- Authors: Anoop Cherian and Chiori Hori and Tim K. Marks and Jonathan Le Roux
- Abstract summary: Video is essentially of 2D "views" of events happening in a 3D space.
We propose a (2.5+1)D scene graph representation to better capture thetemporal information flows inside videos.
- Score: 54.436179346454516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatio-temporal scene-graph approaches to video-based reasoning tasks such as
video question-answering (QA) typically construct such graphs for every video
frame. Such approaches often ignore the fact that videos are essentially
sequences of 2D "views" of events happening in a 3D space, and that the
semantics of the 3D scene can thus be carried over from frame to frame.
Leveraging this insight, we propose a (2.5+1)D scene graph representation to
better capture the spatio-temporal information flows inside the videos.
Specifically, we first create a 2.5D (pseudo-3D) scene graph by transforming
every 2D frame to have an inferred 3D structure using an off-the-shelf 2D-to-3D
transformation module, following which we register the video frames into a
shared (2.5+1)D spatio-temporal space and ground each 2D scene graph within it.
Such a (2.5+1)D graph is then segregated into a static sub-graph and a dynamic
sub-graph, corresponding to whether the objects within them usually move in the
world. The nodes in the dynamic graph are enriched with motion features
capturing their interactions with other graph nodes. Next, for the video QA
task, we present a novel transformer-based reasoning pipeline that embeds the
(2.5+1)D graph into a spatio-temporal hierarchical latent space, where the
sub-graphs and their interactions are captured at varied granularity. To
demonstrate the effectiveness of our approach, we present experiments on the
NExT-QA and AVSD-QA datasets. Our results show that our proposed (2.5+1)D
representation leads to faster training and inference, while our hierarchical
model showcases superior performance on the video QA task versus the state of
the art.
Related papers
- 2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation? [5.408549711581793]
We study the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models.
We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D.
arXiv Detail & Related papers (2024-09-16T15:06:12Z) - LoopGaussian: Creating 3D Cinemagraph with Multi-view Images via Eulerian Motion Field [13.815932949774858]
Cinemagraph is a form of visual media that combines elements of still photography and subtle motion to create a captivating experience.
We propose LoopGaussian to elevate cinemagraph from 2D image space to 3D space using 3D Gaussian modeling.
Experiment results validate the effectiveness of our approach, demonstrating high-quality and visually appealing scene generation.
arXiv Detail & Related papers (2024-04-13T11:07:53Z) - Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling [9.440800948514449]
We propose a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling.
Our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images.
We design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes.
arXiv Detail & Related papers (2024-04-03T07:30:09Z) - SGAligner : 3D Scene Alignment with Scene Graphs [84.01002998166145]
Building 3D scene graphs has emerged as a topic in scene representation for several embodied AI applications.
We focus on the fundamental problem of aligning pairs of 3D scene graphs whose overlap can range from zero to partial.
We propose SGAligner, the first method for aligning pairs of 3D scene graphs that is robust to in-the-wild scenarios.
arXiv Detail & Related papers (2023-04-28T14:39:22Z) - Tracking by 3D Model Estimation of Unknown Objects in Videos [122.56499878291916]
We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation.
Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames.
The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose.
arXiv Detail & Related papers (2023-04-13T11:32:36Z) - Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using
Scene Graphs [85.54212143154986]
Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications.
Scene graphs are representations of a scene composed of objects (nodes) and inter-object relationships (edges)
We propose the first work that directly generates shapes from a scene graph in an end-to-end manner.
arXiv Detail & Related papers (2021-08-19T17:59:07Z) - SceneGraphFusion: Incremental 3D Scene Graph Prediction from RGB-D
Sequences [76.28527350263012]
We propose a method to incrementally build up semantic scene graphs from a 3D environment given a sequence of RGB-D frames.
We aggregate PointNet features from primitive scene components by means of a graph neural network.
Our approach outperforms 3D scene graph prediction methods by a large margin and its accuracy is on par with other 3D semantic and panoptic segmentation methods while running at 35 Hz.
arXiv Detail & Related papers (2021-03-27T13:00:36Z) - Unsupervised object-centric video generation and decomposition in 3D [36.08064849807464]
We propose to model a video as the view seen while moving through a scene with multiple 3D objects and a 3D background.
Our model is trained from monocular videos without any supervision, yet learns to generate coherent 3D scenes containing several moving objects.
arXiv Detail & Related papers (2020-07-07T18:01:29Z) - Learning 3D Semantic Scene Graphs from 3D Indoor Reconstructions [94.17683799712397]
We focus on scene graphs, a data structure that organizes the entities of a scene in a graph.
We propose a learned method that regresses a scene graph from the point cloud of a scene.
We show the application of our method in a domain-agnostic retrieval task, where graphs serve as an intermediate representation for 3D-3D and 2D-3D matching.
arXiv Detail & Related papers (2020-04-08T12:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.