A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects
- URL: http://arxiv.org/abs/2506.19769v1
- Date: Tue, 24 Jun 2025 16:34:56 GMT
- Title: A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects
- Authors: Shulan Ruan, Rongwei Wang, Xuchen Shen, Huijie Liu, Baihui Xiao, Jun Shi, Kun Zhang, Zhenya Huang, Yu Liu, Enhong Chen, You He,
- Abstract summary: Multi-sensor fusion perception (MSFP) is a key technology for embodied AI.<n>Recent achievements on AI-based MSFP methods have been reviewed in relevant surveys.
- Score: 60.31285117477418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-sensor fusion perception (MSFP) is a key technology for embodied AI, which can serve a variety of downstream tasks (e.g., 3D object detection and semantic segmentation) and application scenarios (e.g., autonomous driving and swarm robotics). Recently, impressive achievements on AI-based MSFP methods have been reviewed in relevant surveys. However, we observe that the existing surveys have some limitations after a rigorous and detailed investigation. For one thing, most surveys are oriented to a single task or research field, such as 3D object detection or autonomous driving. Therefore, researchers in other related tasks often find it difficult to benefit directly. For another, most surveys only introduce MSFP from a single perspective of multi-modal fusion, while lacking consideration of the diversity of MSFP methods, such as multi-view fusion and time-series fusion. To this end, in this paper, we hope to organize MSFP research from a task-agnostic perspective, where methods are reported from various technical views. Specifically, we first introduce the background of MSFP. Next, we review multi-modal and multi-agent fusion methods. A step further, time-series fusion methods are analyzed. In the era of LLM, we also investigate multimodal LLM fusion methods. Finally, we discuss open challenges and future directions for MSFP. We hope this survey can help researchers understand the important progress in MSFP and provide possible insights for future research.
Related papers
- Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision [25.31489336119893]
We systematically review the applications of multimodal fusion in key robotic vision tasks.<n>We compare vision-language models (VLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies.<n>We identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation.
arXiv Detail & Related papers (2025-04-03T10:53:07Z) - Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey [124.23247710880008]
multimodal CoT (MCoT) reasoning has recently garnered significant research attention.<n>Existing MCoT studies design various methodologies to address the challenges of image, video, speech, audio, 3D, and structured data.<n>We present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions.
arXiv Detail & Related papers (2025-03-16T18:39:13Z) - Survey on AI-Generated Media Detection: From Non-MLLM to MLLM [51.91311158085973]
Methods for detecting AI-generated media have evolved rapidly.<n>General-purpose detectors based on MLLMs integrate authenticity verification, explainability, and localization capabilities.<n>Ethical and security considerations have emerged as critical global concerns.
arXiv Detail & Related papers (2025-02-07T12:18:20Z) - LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio.
We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods.
We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z) - Large Multimodal Agents: A Survey [78.81459893884737]
Large language models (LLMs) have achieved superior performance in powering text-based AI agents.
There is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain.
This review aims to provide valuable insights and guidelines for future research in this rapidly evolving field.
arXiv Detail & Related papers (2024-02-23T06:04:23Z) - Detecting Multimedia Generated by Large AI Models: A Survey [25.97663040910416]
The aim of this survey is to fill an academic gap and contribute to global AI security efforts.<n>We introduce a novel taxonomy for detection methods, categorized by media modality.<n>We offer a focused analysis from a social media perspective to highlight their broader societal impact.
arXiv Detail & Related papers (2024-01-22T15:08:19Z) - MMDR: A Result Feature Fusion Object Detection Approach for Autonomous
System [5.499393552545591]
The proposed approach, called Multi-Modal Detector based on Result features (MMDR), is designed to work for both 2D and 3D object detection tasks.
The MMDR model incorporates shallow global features during the feature fusion stage, endowing the model with the ability to perceive background information.
arXiv Detail & Related papers (2023-04-19T12:28:42Z) - Recent Advances in Embedding Methods for Multi-Object Tracking: A Survey [71.10448142010422]
Multi-object tracking (MOT) aims to associate target objects across video frames in order to obtain entire moving trajectories.
Embedding methods play an essential role in object location estimation and temporal identity association in MOT.
We first conduct a comprehensive overview with in-depth analysis for embedding methods in MOT from seven different perspectives.
arXiv Detail & Related papers (2022-05-22T06:54:33Z) - Multi-modal Sensor Fusion for Auto Driving Perception: A Survey [29.804411344922382]
We provide a literature review of the existing multi-modal-based methods for perception tasks in autonomous driving.<n>We propose an innovative way that divides them into two major classes, four minor classes by a more reasonable taxonomy in the view of the fusion stage.
arXiv Detail & Related papers (2022-02-06T04:18:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.