IMENet: Joint 3D Semantic Scene Completion and 2D Semantic Segmentation
through Iterative Mutual Enhancement
- URL: http://arxiv.org/abs/2106.15413v1
- Date: Tue, 29 Jun 2021 13:34:20 GMT
- Title: IMENet: Joint 3D Semantic Scene Completion and 2D Semantic Segmentation
through Iterative Mutual Enhancement
- Authors: Jie Li, Laiyan Ding and Rui Huang
- Abstract summary: We propose an Iterative Mutual Enhancement Network (IMENet) to solve 3D semantic scene completion and 2D semantic segmentation.
IMENet interactively refines the two tasks at the late prediction stage.
Our approach outperforms the state of the art on both 3D semantic scene completion and 2D semantic segmentation.
- Score: 12.091735711364239
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D semantic scene completion and 2D semantic segmentation are two tightly
correlated tasks that are both essential for indoor scene understanding,
because they predict the same semantic classes, using positively correlated
high-level features. Current methods use 2D features extracted from early-fused
RGB-D images for 2D segmentation to improve 3D scene completion. We argue that
this sequential scheme does not ensure these two tasks fully benefit each
other, and present an Iterative Mutual Enhancement Network (IMENet) to solve
them jointly, which interactively refines the two tasks at the late prediction
stage. Specifically, two refinement modules are developed under a unified
framework for the two tasks. The first is a 2D Deformable Context Pyramid (DCP)
module, which receives the projection from the current 3D predictions to refine
the 2D predictions. In turn, a 3D Deformable Depth Attention (DDA) module is
proposed to leverage the reprojected results from 2D predictions to update the
coarse 3D predictions. This iterative fusion happens to the stable high-level
features of both tasks at a late stage. Extensive experiments on NYU and NYUCAD
datasets verify the effectiveness of the proposed iterative late fusion scheme,
and our approach outperforms the state of the art on both 3D semantic scene
completion and 2D semantic segmentation.
Related papers
- NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images.
Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z) - Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic
Segmentation [17.557697146752652]
2D & 3D semantic segmentation has become mainstream in 3D scene understanding.
It still remains elusive how to fuse and process the cross-dimensional features from these two distinct spaces.
In this paper, we argue that despite its simplicity, projecting unidirectionally multi-view 2D deep semantic features into the 3D space aligned with 3D deep semantic features could lead to better feature fusion.
arXiv Detail & Related papers (2022-12-13T15:58:25Z) - Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection [11.575945934519442]
LiDAR and camera fusion techniques are promising for achieving 3D object detection in autonomous driving.
Most multi-modal 3D object detection frameworks integrate semantic knowledge from 2D images into 3D LiDAR point clouds.
We propose a general multi-modal fusion framework Multi-Sem Fusion (MSF) to fuse the semantic information from both the 2D image and 3D points scene parsing results.
arXiv Detail & Related papers (2022-12-10T10:54:41Z) - Semantic Dense Reconstruction with Consistent Scene Segments [33.0310121044956]
A method for dense semantic 3D scene reconstruction from an RGB-D sequence is proposed to solve high-level scene understanding tasks.
First, each RGB-D pair is consistently segmented into 2D semantic maps based on a camera tracking backbone.
A dense 3D mesh model of an unknown environment is incrementally generated from the input RGB-D sequence.
arXiv Detail & Related papers (2021-09-30T03:01:17Z) - Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based
Perception [122.53774221136193]
State-of-the-art methods for driving-scene LiDAR-based perception often project the point clouds to 2D space and then process them via 2D convolution.
A natural remedy is to utilize the 3D voxelization and 3D convolution network.
We propose a new framework for the outdoor LiDAR segmentation, where cylindrical partition and asymmetrical 3D convolution networks are designed to explore the 3D geometric pattern.
arXiv Detail & Related papers (2021-09-12T06:25:11Z) - Multi-Modality Task Cascade for 3D Object Detection [22.131228757850373]
Many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data.
We propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions.
We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance.
arXiv Detail & Related papers (2021-07-08T17:55:01Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic
Segmentation [87.54570024320354]
State-of-the-art methods for large-scale driving-scene LiDAR semantic segmentation often project and process the point clouds in the 2D space.
A straightforward solution to tackle the issue of 3D-to-2D projection is to keep the 3D representation and process the points in the 3D space.
We develop a 3D cylinder partition and a 3D cylinder convolution based framework, termed as Cylinder3D, which exploits the 3D topology relations and structures of driving-scene point clouds.
arXiv Detail & Related papers (2020-08-04T13:56:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.