Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic
Segmentation
- URL: http://arxiv.org/abs/2212.06682v1
- Date: Tue, 13 Dec 2022 15:58:25 GMT
- Title: Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic
Segmentation
- Authors: Chaolong Yang, Yuyao Yan, Weiguang Zhao, Jianan Ye, Xi Yang, Amir
Hussain, Kaizhu Huang
- Abstract summary: 2D & 3D semantic segmentation has become mainstream in 3D scene understanding.
It still remains elusive how to fuse and process the cross-dimensional features from these two distinct spaces.
In this paper, we argue that despite its simplicity, projecting unidirectionally multi-view 2D deep semantic features into the 3D space aligned with 3D deep semantic features could lead to better feature fusion.
- Score: 17.557697146752652
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: 3D point clouds are rich in geometric structure information, while 2D images
contain important and continuous texture information. Combining 2D information
to achieve better 3D semantic segmentation has become mainstream in 3D scene
understanding. Albeit the success, it still remains elusive how to fuse and
process the cross-dimensional features from these two distinct spaces. Existing
state-of-the-art usually exploit bidirectional projection methods to align the
cross-dimensional features and realize both 2D & 3D semantic segmentation
tasks. However, to enable bidirectional mapping, this framework often requires
a symmetrical 2D-3D network structure, thus limiting the network's flexibility.
Meanwhile, such dual-task settings may distract the network easily and lead to
over-fitting in the 3D segmentation task. As limited by the network's
inflexibility, fused features can only pass through a decoder network, which
affects model performance due to insufficient depth. To alleviate these
drawbacks, in this paper, we argue that despite its simplicity, projecting
unidirectionally multi-view 2D deep semantic features into the 3D space aligned
with 3D deep semantic features could lead to better feature fusion. On the one
hand, the unidirectional projection enforces our model focused more on the core
task, i.e., 3D segmentation; on the other hand, unlocking the bidirectional to
unidirectional projection enables a deeper cross-domain semantic alignment and
enjoys the flexibility to fuse better and complicated features from very
different spaces. In joint 2D-3D approaches, our proposed method achieves
superior performance on the ScanNetv2 benchmark for 3D semantic segmentation.
Related papers
- DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields [68.94868475824575]
This paper introduces a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations.
We leverage the strong semantic prior within a 3D generative model to train a semantic decoder.
Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data.
arXiv Detail & Related papers (2023-11-18T21:58:28Z) - Exploiting the Complementarity of 2D and 3D Networks to Address
Domain-Shift in 3D Semantic Segmentation [14.30113021974841]
3D semantic segmentation is a critical task in many real-world applications, such as autonomous driving, robotics, and mixed reality.
A possible solution is to combine the 3D information with others coming from sensors featuring a different modality, such as RGB cameras.
Recent multi-modal 3D semantic segmentation networks exploit these modalities relying on two branches that process the 2D and 3D information independently.
arXiv Detail & Related papers (2023-04-06T10:59:43Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images.
Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z) - Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based
Perception [122.53774221136193]
State-of-the-art methods for driving-scene LiDAR-based perception often project the point clouds to 2D space and then process them via 2D convolution.
A natural remedy is to utilize the 3D voxelization and 3D convolution network.
We propose a new framework for the outdoor LiDAR segmentation, where cylindrical partition and asymmetrical 3D convolution networks are designed to explore the 3D geometric pattern.
arXiv Detail & Related papers (2021-09-12T06:25:11Z) - Multi-Modality Task Cascade for 3D Object Detection [22.131228757850373]
Many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data.
We propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions.
We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance.
arXiv Detail & Related papers (2021-07-08T17:55:01Z) - Bidirectional Projection Network for Cross Dimension Scene Understanding [69.29443390126805]
We present a emphbidirectional projection network (BPNet) for joint 2D and 3D reasoning in an end-to-end manner.
Via the emphBPM, complementary 2D and 3D information can interact with each other in multiple architectural levels.
Our emphBPNet achieves top performance on the ScanNetV2 benchmark for both 2D and 3D semantic segmentation.
arXiv Detail & Related papers (2021-03-26T08:31:39Z) - Learning Joint 2D-3D Representations for Depth Completion [90.62843376586216]
We design a simple yet effective neural network block that learns to extract joint 2D and 3D features.
Specifically, the block consists of two domain-specific sub-networks that apply 2D convolution on image pixels and continuous convolution on 3D points.
arXiv Detail & Related papers (2020-12-22T22:58:29Z) - Self-supervised Feature Learning by Cross-modality and Cross-view
Correspondences [32.01548991331616]
This paper presents a novel self-supervised learning approach to learn both 2D image features and 3D point cloud features.
It exploits cross-modality and cross-view correspondences without using any annotated human labels.
The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks.
arXiv Detail & Related papers (2020-04-13T02:57:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.