Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
- URL: http://arxiv.org/abs/2211.09646v1
- Date: Thu, 17 Nov 2022 16:42:39 GMT
- Title: Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
- Authors: Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid,
Ivan Laptev
- Abstract summary: Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations.
We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
- Score: 87.03299519917019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Localizing objects in 3D scenes based on natural language requires
understanding and reasoning about spatial relations. In particular, it is often
crucial to distinguish similar objects referred by the text, such as "the left
most chair" and "a chair next to the window". In this work we propose a
language-conditioned transformer model for grounding 3D objects and their
spatial relations. To this end, we design a spatial self-attention layer that
accounts for relative distances and orientations between objects in input 3D
point clouds. Training such a layer with visual and language inputs enables to
disambiguate spatial relations and to localize objects referred by the text. To
facilitate the cross-modal learning of relations, we further propose a
teacher-student approach where the teacher model is first trained using
ground-truth object labels, and then helps to train a student model using point
cloud inputs. We perform ablation studies showing advantages of our approach.
We also demonstrate our model to significantly outperform the state of the art
on the challenging Nr3D, Sr3D and ScanRefer 3D object grounding datasets.
Related papers
- Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Dense Object Grounding in 3D Scenes [28.05720194887322]
Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding.
We introduce 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence.
Our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object grounding methods and their dense-object variants by significant margins.
arXiv Detail & Related papers (2023-09-05T13:27:19Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space.
Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos.
We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z) - Looking Outside the Box to Ground Language in 3D Scenes [27.126171549887232]
We propose a model for grounding language in 3D scenes with three main innovations.
Iterative attention across the language stream, the point cloud feature stream and 3D box proposals.
Joint supervision from 3D object annotations and language grounding annotations.
When applied on language grounding on 2D images with minor changes, it performs on par with the state-of-the-art while converges in half of the GPU time.
arXiv Detail & Related papers (2021-12-16T13:50:23Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations
in 3D [71.11034329713058]
Existing datasets lack large-scale, high-quality 3D ground truth information.
Rel3D is the first large-scale, human-annotated dataset for grounding spatial relations in 3D.
We propose minimally contrastive data collection -- a novel crowdsourcing method for reducing dataset bias.
arXiv Detail & Related papers (2020-12-03T01:51:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.