Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations
in 3D
- URL: http://arxiv.org/abs/2012.01634v1
- Date: Thu, 3 Dec 2020 01:51:56 GMT
- Title: Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations
in 3D
- Authors: Ankit Goyal, Kaiyu Yang, Dawei Yang, Jia Deng
- Abstract summary: Existing datasets lack large-scale, high-quality 3D ground truth information.
Rel3D is the first large-scale, human-annotated dataset for grounding spatial relations in 3D.
We propose minimally contrastive data collection -- a novel crowdsourcing method for reducing dataset bias.
- Score: 71.11034329713058
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding spatial relations (e.g., "laptop on table") in visual input is
important for both humans and robots. Existing datasets are insufficient as
they lack large-scale, high-quality 3D ground truth information, which is
critical for learning spatial relations. In this paper, we fill this gap by
constructing Rel3D: the first large-scale, human-annotated dataset for
grounding spatial relations in 3D. Rel3D enables quantifying the effectiveness
of 3D information in predicting spatial relations on large-scale human data.
Moreover, we propose minimally contrastive data collection -- a novel
crowdsourcing method for reducing dataset bias. The 3D scenes in our dataset
come in minimally contrastive pairs: two scenes in a pair are almost identical,
but a spatial relation holds in one and fails in the other. We empirically
validate that minimally contrastive examples can diagnose issues with current
relation detection models as well as lead to sample-efficient training. Code
and data are available at https://github.com/princeton-vl/Rel3D.
Related papers
- StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset [56.71580976007712]
We propose to use the Human-Object Offset between anchors which are densely sampled from the surface of human mesh and object mesh to represent human-object spatial relation.
Based on this representation, we propose Stacked Normalizing Flow (StackFLOW) to infer the posterior distribution of human-object spatial relations from the image.
During the optimization stage, we finetune the human body pose and object 6D pose by maximizing the likelihood of samples.
arXiv Detail & Related papers (2024-07-30T04:57:21Z) - STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning [4.676784872259775]
We propose a large-scale video dataset for understanding spatial relationships derived from prepositions of the English language.
The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses.
In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions.
arXiv Detail & Related papers (2023-09-13T02:35:59Z) - A Unified BEV Model for Joint Learning of 3D Local Features and Overlap
Estimation [12.499361832561634]
We present a unified bird's-eye view (BEV) model for jointly learning of 3D local features and overlap estimation.
Our method significantly outperforms existing methods on overlap prediction, especially in scenes with small overlaps.
arXiv Detail & Related papers (2023-02-28T12:01:16Z) - Learning 3D Human Pose Estimation from Dozens of Datasets using a
Geometry-Aware Autoencoder to Bridge Between Skeleton Formats [80.12253291709673]
We propose a novel affine-combining autoencoder (ACAE) method to perform dimensionality reduction on the number of landmarks.
Our approach scales to an extreme multi-dataset regime, where we use 28 3D human pose datasets to supervise one model.
arXiv Detail & Related papers (2022-12-29T22:22:49Z) - Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations.
We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z) - Self-supervised Human Mesh Recovery with Cross-Representation Alignment [20.69546341109787]
Self-supervised human mesh recovery methods have poor generalizability due to limited availability and diversity of 3D-annotated benchmark datasets.
We propose cross-representation alignment utilizing the complementary information from the robust but sparse representation (2D keypoints)
This adaptive cross-representation alignment explicitly learns from the deviations and captures complementary information: richness from sparse representation and robustness from dense representation.
arXiv Detail & Related papers (2022-09-10T04:47:20Z) - PolarMOT: How Far Can Geometric Relations Take Us in 3D Multi-Object
Tracking? [62.997667081978825]
We encode 3D detections as nodes in a graph, where spatial and temporal pairwise relations among objects are encoded via localized polar coordinates on graph edges.
This allows our graph neural network to learn to effectively encode temporal and spatial interactions.
We establish a new state-of-the-art on nuScenes dataset and, more importantly, show that our method, PolarMOT, generalizes remarkably well across different locations.
arXiv Detail & Related papers (2022-08-03T10:06:56Z) - RandomRooms: Unsupervised Pre-training from Synthetic Shapes and
Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets.
Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications.
In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.