Dense Object Grounding in 3D Scenes
- URL: http://arxiv.org/abs/2309.02224v1
- Date: Tue, 5 Sep 2023 13:27:19 GMT
- Title: Dense Object Grounding in 3D Scenes
- Authors: Wencan Huang, Daizong Liu, Wei Hu
- Abstract summary: Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding.
We introduce 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence.
Our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object grounding methods and their dense-object variants by significant margins.
- Score: 28.05720194887322
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Localizing objects in 3D scenes according to the semantics of a given natural
language is a fundamental yet important task in the field of multimedia
understanding, which benefits various real-world applications such as robotics
and autonomous driving. However, the majority of existing 3D object grounding
methods are restricted to a single-sentence input describing an individual
object, which cannot comprehend and reason more contextualized descriptions of
multiple objects in more practical 3D cases. To this end, we introduce a new
challenging task, called 3D Dense Object Grounding (3D DOG), to jointly
localize multiple objects described in a more complicated paragraph rather than
a single sentence. Instead of naively localizing each sentence-guided object
independently, we found that dense objects described in the same paragraph are
often semantically related and spatially located in a focused region of the 3D
scene. To explore such semantic and spatial relationships of densely referred
objects for more accurate localization, we propose a novel Stacked Transformer
based framework for 3D DOG, named 3DOGSFormer. Specifically, we first devise a
contextual query-driven local transformer decoder to generate initial grounding
proposals for each target object. Then, we employ a proposal-guided global
transformer decoder that exploits the local object features to learn their
correlation for further refining initial grounding proposals. Extensive
experiments on three challenging benchmarks (Nr3D, Sr3D, and ScanRefer) show
that our proposed 3DOGSFormer outperforms state-of-the-art 3D single-object
grounding methods and their dense-object variants by significant margins.
Related papers
- Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level.
Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z) - Multi3DRefer: Grounding Text Description to Multiple 3D Objects [15.54885309441946]
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions.
Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description.
We develop a better baseline leveraging 2D features from CLIP by rendering proposals online with contrastive learning, which outperforms the state of the art on the ScanRefer benchmark.
arXiv Detail & Related papers (2023-09-11T06:03:39Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations.
We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Looking Outside the Box to Ground Language in 3D Scenes [27.126171549887232]
We propose a model for grounding language in 3D scenes with three main innovations.
Iterative attention across the language stream, the point cloud feature stream and 3D box proposals.
Joint supervision from 3D object annotations and language grounding annotations.
When applied on language grounding on 2D images with minor changes, it performs on par with the state-of-the-art while converges in half of the GPU time.
arXiv Detail & Related papers (2021-12-16T13:50:23Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.