MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving
- URL: http://arxiv.org/abs/2303.08600v1
- Date: Wed, 15 Mar 2023 13:13:03 GMT
- Title: MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving
- Authors: Jiale Li, Hang Dai, Hao Han, Yong Ding
- Abstract summary: We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint intra-modal feature extraction and inter-modal feature fusion.
MSeg3D still shows robustness and improves the LiDAR-only baseline.
- Score: 15.36416000750147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LiDAR and camera are two modalities available for 3D semantic segmentation in
autonomous driving. The popular LiDAR-only methods severely suffer from
inferior segmentation on small and distant objects due to insufficient laser
points, while the robust multi-modal solution is under-explored, where we
investigate three crucial inherent difficulties: modality heterogeneity,
limited sensor field of view intersection, and multi-modal data augmentation.
We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint
intra-modal feature extraction and inter-modal feature fusion to mitigate the
modality heterogeneity. The multi-modal fusion in MSeg3D consists of
geometry-based feature fusion GF-Phase, cross-modal feature completion, and
semantic-based feature fusion SF-Phase on all visible points. The multi-modal
data augmentation is reinvigorated by applying asymmetric transformations on
LiDAR point cloud and multi-camera images individually, which benefits the
model training with diversified augmentation transformations. MSeg3D achieves
state-of-the-art results on nuScenes, Waymo, and SemanticKITTI datasets. Under
the malfunctioning multi-camera input and the multi-frame point clouds input,
MSeg3D still shows robustness and improves the LiDAR-only baseline. Our code is
publicly available at \url{https://github.com/jialeli1/lidarseg3d}.
Related papers
- FGU3R: Fine-Grained Fusion via Unified 3D Representation for Multimodal 3D Object Detection [10.070120335536075]
Multimodal 3D object detection has garnered considerable interest in autonomous driving.
However, multimodal detectors suffer from dimension mismatches that derive from fusing 3D points with 2D pixels coarsely.
We propose a multimodal framework FGU3R to tackle the issue via unified 3D representation and fine-grained fusion.
arXiv Detail & Related papers (2025-01-08T09:26:36Z) - LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets.
Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples.
Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z) - mmFUSION: Multimodal Fusion for 3D Objects Detection [18.401155770778757]
Multi-sensor fusion is essential for accurate 3D object detection in self-driving systems.
In this paper, we propose a new intermediate-level multi-modal fusion approach to overcome these challenges.
The code with the mmdetection3D project plugin will be publicly available soon.
arXiv Detail & Related papers (2023-11-07T15:11:27Z) - UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks.
This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving.
To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z) - Beyond First Impressions: Integrating Joint Multi-modal Cues for
Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text.
Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space.
We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z) - SimDistill: Simulated Multi-modal Distillation for BEV 3D Object
Detection [56.24700754048067]
Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging.
We propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy.
Our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment.
arXiv Detail & Related papers (2023-03-29T16:08:59Z) - Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object
Detection [16.198358858773258]
Multi-modal 3D object detection has been an active research topic in autonomous driving.
It is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels.
Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels.
arXiv Detail & Related papers (2022-10-18T06:15:56Z) - MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth
Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems.
Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion.
We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z) - DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection [83.18142309597984]
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving.
We develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods.
arXiv Detail & Related papers (2022-03-15T18:46:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.