Toward Explainable and Fine-Grained 3D Grounding through Referring
Textual Phrases
- URL: http://arxiv.org/abs/2207.01821v2
- Date: Sat, 27 May 2023 10:03:34 GMT
- Title: Toward Explainable and Fine-Grained 3D Grounding through Referring
Textual Phrases
- Authors: Zhihao Yuan, Xu Yan, Zhuo Li, Xuhao Li, Yao Guo, Shuguang Cui, Zhen Li
- Abstract summary: 3DPAG task aims to localize the target objects in a 3D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases.
By tapping on our datasets, we can extend previous 3DVG methods to the fine-grained phrase-aware scenario.
Results confirm significant improvements, i.e., previous state-of-the-art method achieves 3.9%, 3.5% and 4.6% overall accuracy gains on Nr3D, Sr3D and ScanRefer respectively.
- Score: 35.18565109770112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in 3D scene understanding has explored visual grounding
(3DVG) to localize a target object through a language description. However,
existing methods only consider the dependency between the entire sentence and
the target object, ignoring fine-grained relationships between contexts and
non-target ones. In this paper, we extend 3DVG to a more fine-grained and
interpretable task, called 3D Phrase Aware Grounding (3DPAG). The 3DPAG task
aims to localize the target objects in a 3D scene by explicitly identifying all
phrase-related objects and then conducting the reasoning according to
contextual phrases. To tackle this problem, we manually labeled about 227K
phrase-level annotations using a self-developed platform, from 88K sentences of
widely used 3DVG datasets, i.e., Nr3D, Sr3D and ScanRefer. By tapping on our
datasets, we can extend previous 3DVG methods to the fine-grained phrase-aware
scenario. It is achieved through the proposed novel phrase-object alignment
optimization and phrase-specific pre-training, boosting conventional 3DVG
performance as well. Extensive results confirm significant improvements, i.e.,
previous state-of-the-art method achieves 3.9%, 3.5% and 4.6% overall accuracy
gains on Nr3D, Sr3D and ScanRefer respectively.
Related papers
- A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions [27.469346807311574]
Text-guided 3D visual grounding (T-3DVG) aims to locate a specific object that semantically corresponds to a language query from a complicated 3D scene.
Compared to 2D visual grounding, this task presents great potential and challenges due to its closer proximity to the real world and the complexity of data collection and 3D point cloud source processing.
arXiv Detail & Related papers (2024-06-09T13:52:12Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images.
During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance [72.6809373191638]
We propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels.
Specifically, we design a feature-level constraint to align LiDAR and image features based on object-aware regions.
Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations.
Third, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data.
arXiv Detail & Related papers (2023-12-12T18:57:25Z) - Object2Scene: Putting Objects in Context for Open-Vocabulary 3D
Detection [24.871590175483096]
Point cloud-based open-vocabulary 3D object detection aims to detect 3D categories that do not have ground-truth annotations in the training set.
Previous approaches leverage large-scale richly-annotated image datasets as a bridge between 3D and category semantics.
We propose Object2Scene, the first approach that leverages large-scale large-vocabulary 3D object datasets to augment existing 3D scene datasets for open-vocabulary 3D object detection.
arXiv Detail & Related papers (2023-09-18T03:31:53Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly
Supervised 3D Visual Grounding [58.924180772480504]
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query.
We propose to leverage weakly supervised annotations to learn the 3D visual grounding model.
We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-07-18T13:49:49Z) - ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved
Visio-Linguistic Models in 3D Scenes [48.65360357173095]
Scan Entities in 3D (ScanEnts3D) dataset provides explicit correspondences between 369k objects across 84k natural referential sentences.
We show that by incorporating intuitive losses that enable learning from this novel dataset, we can significantly improve the performance of several recently introduced neural listening architectures.
arXiv Detail & Related papers (2022-12-12T21:25:58Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Point2Seq: Detecting 3D Objects as Sequences [58.63662049729309]
We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds.
We view each 3D object as a sequence of words and reformulate the 3D object detection task as decoding words from 3D scenes in an auto-regressive manner.
arXiv Detail & Related papers (2022-03-25T00:20:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.