Related papers: Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation

Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation

URL: http://arxiv.org/abs/2310.15676v1
Date: Tue, 24 Oct 2023 09:39:05 GMT
Title: Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation
Authors: Yinjie Lei, Zixuan Wang, Feng Chen, Guoqing Wang, Peng Wang and Yang Yang
Abstract summary: Multi-modal 3D scene understanding has gained considerable attention due to its wide applications in many areas, such as autonomous driving and human-computer interaction. introducing an additional modality not only elevates the richness and precision of scene interpretation but also ensures a more robust and resilient understanding. We present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations.
Score: 28.417029383793068
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multi-modal 3D scene understanding has gained considerable attention due to its wide applications in many areas, such as autonomous driving and human-computer interaction. Compared to conventional single-modal 3D understanding, introducing an additional modality not only elevates the richness and precision of scene interpretation but also ensures a more robust and resilient understanding. This becomes especially crucial in varied and challenging environments where solely relying on 3D data might be inadequate. While there has been a surge in the development of multi-modal 3D methods over past three years, especially those integrating multi-camera images (3D+2D) and textual descriptions (3D+language), a comprehensive and in-depth review is notably absent. In this article, we present a systematic survey of recent progress to bridge this gap. We begin by briefly introducing a background that formally defines various 3D multi-modal tasks and summarizes their inherent challenges. After that, we present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations. Furthermore, comparative results of recent approaches on several benchmark datasets, together with insightful analysis, are offered. Finally, we discuss the unresolved issues and provide several potential avenues for future research.

Related papers

Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering [28.717312557697376]
3D Scene Question Answering represents an interdisciplinary task that integrates 3D visual perception and natural language processing. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. This paper presents the first comprehensive survey of 3D SQA, systematically reviewing datasets, methodologies, and evaluation metrics.
arXiv Detail & Related papers (2025-02-01T07:01:33Z)
Foundational Models for 3D Point Clouds: A Survey and Outlook [50.61473863985571]
3D point cloud representation plays a crucial role in preserving the geometric fidelity of the physical world. To bridge this gap, it becomes essential to incorporate multiple modalities. Foundation models (FMs) can seamlessly integrate and reason across these modalities.
arXiv Detail & Related papers (2025-01-30T18:59:43Z)
Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We bridge the research gaps by proposing a 3D reasoning segmentation task for multiple objects in scenes. The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects. In addition, we design MORE3D, a simple yet effective method that enables multi-object 3D reasoning segmentation with user questions and textual outputs.
arXiv Detail & Related papers (2024-11-21T08:22:45Z)
Multi-modal Situated Reasoning in 3D Scenes [32.800524889357305]
We propose Multi-modal Situated Question Answering (MSQA), a large-scale multi-modal situated reasoning dataset. MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios within 3D scenes. We also devise the Multi-modal Situated Next-step Navigation (MSNN) benchmark to evaluate models' situated reasoning for navigation.
arXiv Detail & Related papers (2024-09-04T02:37:38Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)
OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation [67.56268991234371]
OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6% on average. Code and pre-trained models will be released later.
arXiv Detail & Related papers (2024-03-28T17:05:04Z)
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes [80.20670062509723]
3D dense captioning is an emerging vision-language bridging task that aims to generate detailed descriptions for 3D scenes. It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning. Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field.
arXiv Detail & Related papers (2024-03-12T10:04:08Z)
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts [30.571811801090224]
We introduce a comprehensive 3D instructionfollowing dataset called M3DBench. It supports general multimodal instructions interleaved with text, images, 3D objects, and other visual prompts. It unifies diverse 3D tasks at both region and scene levels, covering a variety of fundamental abilities in real-world 3D environments.
arXiv Detail & Related papers (2023-12-17T16:53:30Z)
3D Multiple Object Tracking on Autonomous Driving: A Literature Review [25.568952977339]
3D multi-object tracking (3D MOT) stands as a pivotal domain within autonomous driving. Despite its paramount significance, 3D MOT confronts a myriad of formidable challenges.
arXiv Detail & Related papers (2023-09-27T05:32:26Z)
HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for Autonomous Driving [95.42203932627102]
3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians. Our method efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin. Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages.
arXiv Detail & Related papers (2022-12-15T11:15:14Z)
Recent Advances in Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective [69.44384540002358]
We provide a comprehensive and holistic 2D-to-3D perspective to tackle this problem. We categorize the mainstream and milestone approaches since the year 2014 under unified frameworks. We also summarize the pose representation styles, benchmarks, evaluation metrics, and the quantitative performance of popular approaches.
arXiv Detail & Related papers (2021-04-23T11:07:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.