MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving
- URL: http://arxiv.org/abs/2303.08600v1
- Date: Wed, 15 Mar 2023 13:13:03 GMT
- Title: MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving
- Authors: Jiale Li, Hang Dai, Hao Han, Yong Ding
- Abstract summary: We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint intra-modal feature extraction and inter-modal feature fusion.
MSeg3D still shows robustness and improves the LiDAR-only baseline.
- Score: 15.36416000750147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LiDAR and camera are two modalities available for 3D semantic segmentation in
autonomous driving. The popular LiDAR-only methods severely suffer from
inferior segmentation on small and distant objects due to insufficient laser
points, while the robust multi-modal solution is under-explored, where we
investigate three crucial inherent difficulties: modality heterogeneity,
limited sensor field of view intersection, and multi-modal data augmentation.
We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint
intra-modal feature extraction and inter-modal feature fusion to mitigate the
modality heterogeneity. The multi-modal fusion in MSeg3D consists of
geometry-based feature fusion GF-Phase, cross-modal feature completion, and
semantic-based feature fusion SF-Phase on all visible points. The multi-modal
data augmentation is reinvigorated by applying asymmetric transformations on
LiDAR point cloud and multi-camera images individually, which benefits the
model training with diversified augmentation transformations. MSeg3D achieves
state-of-the-art results on nuScenes, Waymo, and SemanticKITTI datasets. Under
the malfunctioning multi-camera input and the multi-frame point clouds input,
MSeg3D still shows robustness and improves the LiDAR-only baseline. Our code is
publicly available at \url{https://github.com/jialeli1/lidarseg3d}.
Related papers
- Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones.
MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z) - mmFUSION: Multimodal Fusion for 3D Objects Detection [18.401155770778757]
Multi-sensor fusion is essential for accurate 3D object detection in self-driving systems.
In this paper, we propose a new intermediate-level multi-modal fusion approach to overcome these challenges.
The code with the mmdetection3D project plugin will be publicly available soon.
arXiv Detail & Related papers (2023-11-07T15:11:27Z) - UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks.
This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving.
To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z) - Beyond First Impressions: Integrating Joint Multi-modal Cues for
Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text.
Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space.
We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z) - SimDistill: Simulated Multi-modal Distillation for BEV 3D Object
Detection [56.24700754048067]
Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging.
We propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy.
Our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment.
arXiv Detail & Related papers (2023-03-29T16:08:59Z) - Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object
Detection [16.198358858773258]
Multi-modal 3D object detection has been an active research topic in autonomous driving.
It is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels.
Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels.
arXiv Detail & Related papers (2022-10-18T06:15:56Z) - MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth
Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems.
Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion.
We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z) - DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection [83.18142309597984]
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving.
We develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods.
arXiv Detail & Related papers (2022-03-15T18:46:06Z) - Exploring Data Augmentation for Multi-Modality 3D Object Detection [82.9988604088494]
It is counter-intuitive that multi-modality methods based on point cloud and images perform only marginally better or sometimes worse than approaches that solely use point cloud.
We propose a pipeline, named transformation flow, to bridge the gap between single and multi-modality data augmentation with transformation reversing and replaying.
Our method also wins the best PKL award in the 3rd nuScenes detection challenge.
arXiv Detail & Related papers (2020-12-23T15:23:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.