A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions
- URL: http://arxiv.org/abs/2406.05785v2
- Date: Mon, 22 Jul 2024 03:21:27 GMT
- Title: A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions
- Authors: Daizong Liu, Yang Liu, Wencan Huang, Wei Hu,
- Abstract summary: Text-guided 3D visual grounding (T-3DVG) aims to locate a specific object that semantically corresponds to a language query from a complicated 3D scene.
Compared to 2D visual grounding, this task presents great potential and challenges due to its closer proximity to the real world and the complexity of data collection and 3D point cloud source processing.
- Score: 27.469346807311574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-guided 3D visual grounding (T-3DVG), which aims to locate a specific object that semantically corresponds to a language query from a complicated 3D scene, has drawn increasing attention in the 3D research community over the past few years. Compared to 2D visual grounding, this task presents great potential and challenges due to its closer proximity to the real world and the complexity of data collection and 3D point cloud source processing. In this survey, we attempt to provide a comprehensive overview of the T-3DVG progress, including its fundamental elements, recent research advances, and future research directions. To the best of our knowledge, this is the first systematic survey on the T-3DVG task. Specifically, we first provide a general structure of the T-3DVG pipeline with detailed components in a tutorial style, presenting a complete background overview. Then, we summarize the existing T-3DVG approaches into different categories and analyze their strengths and weaknesses. We also present the benchmark datasets and evaluation metrics to assess their performances. Finally, we discuss the potential limitations of existing T-3DVG and share some insights on several promising research directions. The latest papers are continually collected at https://github.com/liudaizong/Awesome-3D-Visual-Grounding.
Related papers
- General Geometry-aware Weakly Supervised 3D Object Detection [62.26729317523975]
A unified framework is developed for learning 3D object detectors from RGB images and associated 2D boxes.
Experiments on KITTI and SUN-RGBD datasets demonstrate that our method yields surprisingly high-quality 3D bounding boxes with only 2D annotation.
arXiv Detail & Related papers (2024-07-18T17:52:08Z) - A Comprehensive Survey on 3D Content Generation [148.434661725242]
3D content generation shows both academic and practical values.
New taxonomy is proposed that categorizes existing approaches into three types: 3D native generative methods, 2D prior-based 3D generative methods, and hybrid 3D generative methods.
arXiv Detail & Related papers (2024-02-02T06:20:44Z) - Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
Existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries.
We propose textbf3D-VLA, a weakly supervised approach for textbf3D visual grounding based on textbfVisual textbfLinguistic textbfAlignment.
Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z) - Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance [72.6809373191638]
We propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels.
Specifically, we design a feature-level constraint to align LiDAR and image features based on object-aware regions.
Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations.
Third, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data.
arXiv Detail & Related papers (2023-12-12T18:57:25Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Toward Explainable and Fine-Grained 3D Grounding through Referring
Textual Phrases [35.18565109770112]
3DPAG task aims to localize the target objects in a 3D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases.
By tapping on our datasets, we can extend previous 3DVG methods to the fine-grained phrase-aware scenario.
Results confirm significant improvements, i.e., previous state-of-the-art method achieves 3.9%, 3.5% and 4.6% overall accuracy gains on Nr3D, Sr3D and ScanRefer respectively.
arXiv Detail & Related papers (2022-07-05T05:50:12Z) - 3D Object Detection for Autonomous Driving: A Survey [14.772968858398043]
3D object detection serves as the core basis of such perception system.
Despite existing efforts, 3D object detection on point clouds is still in its infancy.
Recent state-of-the-art detection methods with their pros and cons are presented.
arXiv Detail & Related papers (2021-06-21T03:17:20Z) - Deep Learning Based 3D Segmentation: A Survey [29.402585297221457]
3D segmentation is a fundamental problem in computer vision with applications in autonomous driving, robotics, augmented reality and medical image analysis.
Deep learning techniques have recently become the tool of choice for 3D segmentation tasks.
This paper fills the gap and provides a comprehensive survey of the recent progress made in deep learning based 3D segmentation.
arXiv Detail & Related papers (2021-03-09T13:58:35Z) - PointContrast: Unsupervised Pre-training for 3D Point Cloud
Understanding [107.02479689909164]
In this work, we aim at facilitating research on 3D representation learning.
We measure the effect of unsupervised pre-training on a large source set of 3D scenes.
arXiv Detail & Related papers (2020-07-21T17:59:22Z) - Semantic Correspondence via 2D-3D-2D Cycle [58.023058561837686]
We propose a new method on predicting semantic correspondences by leveraging it to 3D domain.
We show that our method gives comparative and even superior results on standard semantic benchmarks.
arXiv Detail & Related papers (2020-04-20T05:27:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.