EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual
Grounding
- URL: http://arxiv.org/abs/2209.14941v3
- Date: Mon, 24 Apr 2023 13:16:57 GMT
- Title: EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual
Grounding
- Authors: Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, Jian Zhang
- Abstract summary: 3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues.
We present EDA that Explicitly Decouples the textual attributes in a sentence.
We further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity.
- Score: 4.447173454116189
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D visual grounding aims to find the object within point clouds mentioned by
free-form natural language descriptions with rich semantic cues. However,
existing methods either extract the sentence-level features coupling all words
or focus more on object names, which would lose the word-level information or
neglect other attributes. To alleviate these issues, we present EDA that
Explicitly Decouples the textual attributes in a sentence and conducts Dense
Alignment between such fine-grained language and point cloud objects.
Specifically, we first propose a text decoupling module to produce textual
features for every semantic component. Then, we design two losses to supervise
the dense matching between two modalities: position alignment loss and semantic
alignment loss. On top of that, we further introduce a new visual grounding
task, locating objects without object names, which can thoroughly evaluate the
model's dense alignment capacity. Through experiments, we achieve
state-of-the-art performance on two widely-adopted 3D visual grounding
datasets, ScanRefer and SR3D/NR3D, and obtain absolute leadership on our
newly-proposed task. The source code is available at
https://github.com/yanmin-wu/EDA.
Related papers
- LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly
Supervised 3D Visual Grounding [58.924180772480504]
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query.
We propose to leverage weakly supervised annotations to learn the 3D visual grounding model.
We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-07-18T13:49:49Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Free-form Description Guided 3D Visual Graph Network for Object
Grounding in Point Cloud [39.055928838826226]
3D object grounding aims to locate the most relevant target object in a raw point cloud scene based on a free-form language description.
We propose a language scene graph module to capture the rich structure and long-distance phrase correlations.
Secondly, we introduce a multi-level 3D proposal relation graph module to extract the object-object and object-scene co-occurrence relationships.
arXiv Detail & Related papers (2021-03-30T14:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.