A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing
Objects in 3D Scenes
- URL: http://arxiv.org/abs/2403.07469v1
- Date: Tue, 12 Mar 2024 10:04:08 GMT
- Title: A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing
Objects in 3D Scenes
- Authors: Ting Yu, Xiaojun Lin, Shuhui Wang, Weiguo Sheng, Qingming Huang, Jun
Yu
- Abstract summary: 3D dense captioning is an emerging vision-language bridging task that aims to generate detailed descriptions for 3D scenes.
It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning.
Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field.
- Score: 80.20670062509723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Three-Dimensional (3D) dense captioning is an emerging vision-language
bridging task that aims to generate multiple detailed and accurate descriptions
for 3D scenes. It presents significant potential and challenges due to its
closer representation of the real world compared to 2D visual captioning, as
well as complexities in data collection and processing of 3D point cloud
sources. Despite the popularity and success of existing methods, there is a
lack of comprehensive surveys summarizing the advancements in this field, which
hinders its progress. In this paper, we provide a comprehensive review of 3D
dense captioning, covering task definition, architecture classification,
dataset analysis, evaluation metrics, and in-depth prosperity discussions.
Based on a synthesis of previous literature, we refine a standard pipeline that
serves as a common paradigm for existing methods. We also introduce a clear
taxonomy of existing models, summarize technologies involved in different
modules, and conduct detailed experiment analysis. Instead of a chronological
order introduction, we categorize the methods into different classes to
facilitate exploration and analysis of the differences and connections among
existing techniques. We also provide a reading guideline to assist readers with
different backgrounds and purposes in reading efficiently. Furthermore, we
propose a series of promising future directions for 3D dense captioning by
identifying challenges and aligning them with the development of related tasks,
offering valuable insights and inspiring future research in this field. Our aim
is to provide a comprehensive understanding of 3D dense captioning, foster
further investigations, and contribute to the development of novel applications
in multimedia and related domains.
Related papers
- 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance [68.8825501902835]
3DSS-VLG is a weakly supervised approach for 3D Semantic with 2D Vision-Language Guidance.
To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels.
arXiv Detail & Related papers (2024-07-13T09:39:11Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models [113.18524940863841]
This survey provides a comprehensive overview of the methodologies enabling large language models to process, understand, and generate 3D data.
Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs)
It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue.
arXiv Detail & Related papers (2024-05-16T16:59:58Z) - Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive
Survey and Evaluation [28.417029383793068]
Multi-modal 3D scene understanding has gained considerable attention due to its wide applications in many areas, such as autonomous driving and human-computer interaction.
introducing an additional modality not only elevates the richness and precision of scene interpretation but also ensures a more robust and resilient understanding.
We present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations.
arXiv Detail & Related papers (2023-10-24T09:39:05Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - 3D objects and scenes classification, recognition, segmentation, and
reconstruction using 3D point cloud data: A review [5.85206759397617]
Three-dimensional (3D) point cloud analysis has become one of the attractive subjects in realistic imaging and machine visions.
A significant effort has recently been devoted to developing novel strategies, using different techniques such as deep learning models.
Various tasks performed on 3D point could data are investigated, including objects and scenes detection, recognition, segmentation and reconstruction.
arXiv Detail & Related papers (2023-06-09T15:45:23Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.