STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
- URL: http://arxiv.org/abs/2309.06680v1
- Date: Wed, 13 Sep 2023 02:35:59 GMT
- Title: STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning
- Authors: Palaash Agrawal, Haidi Azaman, Cheston Tan
- Abstract summary: We propose a large-scale video dataset for understanding spatial relationships derived from prepositions of the English language.
The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses.
In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions.
- Score: 5.256237513030104
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding relations between objects is crucial for understanding the
semantics of a visual scene. It is also an essential step in order to bridge
visual and language models. However, current state-of-the-art computer vision
models still lack the ability to perform spatial reasoning well. Existing
datasets mostly cover a relatively small number of spatial relations, all of
which are static relations that do not intrinsically involve motion. In this
paper, we propose the Spatial and Temporal Understanding of Prepositions
Dataset (STUPD) -- a large-scale video dataset for understanding static and
dynamic spatial relationships derived from prepositions of the English
language. The dataset contains 150K visual depictions (videos and images),
consisting of 30 distinct spatial prepositional senses, in the form of object
interaction simulations generated synthetically using Unity3D. In addition to
spatial relations, we also propose 50K visual depictions across 10 temporal
relations, consisting of videos depicting event/time-point interactions. To our
knowledge, no dataset exists that represents temporal relations through visual
settings. In this dataset, we also provide 3D information about object
interactions such as frame-wise coordinates, and descriptions of the objects
used. The goal of this synthetic dataset is to help models perform better in
visual relationship detection in real-world settings. We demonstrate an
increase in the performance of various models over 2 real-world datasets
(ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in
comparison to other pretraining datasets.
Related papers
- Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects.
We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image.
Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z) - Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations.
We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z) - PolarMOT: How Far Can Geometric Relations Take Us in 3D Multi-Object
Tracking? [62.997667081978825]
We encode 3D detections as nodes in a graph, where spatial and temporal pairwise relations among objects are encoded via localized polar coordinates on graph edges.
This allows our graph neural network to learn to effectively encode temporal and spatial interactions.
We establish a new state-of-the-art on nuScenes dataset and, more importantly, show that our method, PolarMOT, generalizes remarkably well across different locations.
arXiv Detail & Related papers (2022-08-03T10:06:56Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal
Networks for HOI in videos [13.25502885135043]
Analyzing the interactions between humans and objects from a video includes identification of relationships between humans and the objects present in the video.
We present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture truth at multiple granularities in a video.
We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO.
arXiv Detail & Related papers (2020-12-17T05:44:07Z) - Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations
in 3D [71.11034329713058]
Existing datasets lack large-scale, high-quality 3D ground truth information.
Rel3D is the first large-scale, human-annotated dataset for grounding spatial relations in 3D.
We propose minimally contrastive data collection -- a novel crowdsourcing method for reducing dataset bias.
arXiv Detail & Related papers (2020-12-03T01:51:56Z) - RELATE: Physically Plausible Multi-Object Scene Synthesis Using
Structured Latent Spaces [77.07767833443256]
We present RELATE, a model that learns to generate physically plausible scenes and videos of multiple interacting objects.
In contrast to state-of-the-art methods in object-centric generative modeling, RELATE also extends naturally to dynamic scenes and generates videos of high visual fidelity.
arXiv Detail & Related papers (2020-07-02T17:27:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.