Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive
Survey and Evaluation
- URL: http://arxiv.org/abs/2310.15676v1
- Date: Tue, 24 Oct 2023 09:39:05 GMT
- Title: Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive
Survey and Evaluation
- Authors: Yinjie Lei, Zixuan Wang, Feng Chen, Guoqing Wang, Peng Wang and Yang
Yang
- Abstract summary: Multi-modal 3D scene understanding has gained considerable attention due to its wide applications in many areas, such as autonomous driving and human-computer interaction.
introducing an additional modality not only elevates the richness and precision of scene interpretation but also ensures a more robust and resilient understanding.
We present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations.
- Score: 28.417029383793068
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research.
Related papers
- Multi-modal Situated Reasoning in 3D Scenes [32.800524889357305]
We propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset.
MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes.
We also devise the Multi-modal Situated Next-step Navigation (MSNN) benchmark to evaluate models' situated reasoning for navigation.
arXiv Detail & Related papers (2024-09-04T02:37:38Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation [67.56268991234371]
OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6% on average.
Code and pre-trained models will be released later.
arXiv Detail & Related papers (2024-03-28T17:05:04Z) - A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing
Objects in 3D Scenes [80.20670062509723]
3D dense captioning is an emerging vision-language bridging task that aims to generate detailed descriptions for 3D scenes.
It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning.
Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field.
arXiv Detail & Related papers (2024-03-12T10:04:08Z) - M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts [30.571811801090224]
We introduce a comprehensive 3D instructionfollowing dataset called M3DBench.
It supports general multimodal instructions interleaved with text, images, 3D objects, and other visual prompts.
It unifies diverse 3D tasks at both region and scene levels, covering a variety of fundamental abilities in real-world 3D environments.
arXiv Detail & Related papers (2023-12-17T16:53:30Z) - 3D Multiple Object Tracking on Autonomous Driving: A Literature Review [25.568952977339]
3D multi-object tracking (3D MOT) stands as a pivotal domain within autonomous driving.
Despite its paramount significance, 3D MOT confronts a myriad of formidable challenges.
arXiv Detail & Related papers (2023-09-27T05:32:26Z) - HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for
Autonomous Driving [95.42203932627102]
3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians.
Our method efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin.
Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages.
arXiv Detail & Related papers (2022-12-15T11:15:14Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Recent Advances in Monocular 2D and 3D Human Pose Estimation: A Deep
Learning Perspective [69.44384540002358]
We provide a comprehensive and holistic 2D-to-3D perspective to tackle this problem.
We categorize the mainstream and milestone approaches since the year 2014 under unified frameworks.
We also summarize the pose representation styles, benchmarks, evaluation metrics, and the quantitative performance of popular approaches.
arXiv Detail & Related papers (2021-04-23T11:07:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.