NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space
- URL: http://arxiv.org/abs/2309.14616v3
- Date: Wed, 11 Oct 2023 22:15:54 GMT
- Title: NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space
- Authors: Jiawei Yao, Chuming Li, Keqiang Sun, Yingjie Cai, Hao Li, Wanli Ouyang
and Hongsheng Li
- Abstract summary: Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
- Score: 77.6067460464962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular 3D Semantic Scene Completion (SSC) has garnered significant
attention in recent years due to its potential to predict complex semantics and
geometry shapes from a single image, requiring no 3D inputs. In this paper, we
identify several critical issues in current state-of-the-art methods, including
the Feature Ambiguity of projected 2D features in the ray to the 3D space, the
Pose Ambiguity of the 3D convolution, and the Computation Imbalance in the 3D
convolution across different depth levels. To address these problems, we devise
a novel Normalized Device Coordinates scene completion network (NDC-Scene) that
directly extends the 2D feature map to a Normalized Device Coordinates (NDC)
space, rather than to the world space directly, through progressive restoration
of the dimension of depth with deconvolution operations. Experiment results
demonstrate that transferring the majority of computation from the target 3D
space to the proposed normalized device coordinates space benefits monocular
SSC tasks. Additionally, we design a Depth-Adaptive Dual Decoder to
simultaneously upsample and fuse the 2D and 3D feature maps, further improving
overall performance. Our extensive experiments confirm that the proposed method
consistently outperforms state-of-the-art methods on both outdoor SemanticKITTI
and indoor NYUv2 datasets. Our code are available at
https://github.com/Jiawei-Yao0812/NDCScene.
Related papers
- Robust 3D Semantic Occupancy Prediction with Calibration-free Spatial Transformation [32.50849425431012]
For autonomous cars equipped with multi-camera and LiDAR, it is critical to aggregate multi-sensor information into a unified 3D space for accurate and robust predictions.
Recent methods are mainly built on the 2D-to-3D transformation that relies on sensor calibration to project the 2D image information into the 3D space.
In this work, we propose a calibration-free spatial transformation based on vanilla attention to implicitly model the spatial correspondence.
arXiv Detail & Related papers (2024-11-19T02:40:42Z) - Tri-Perspective View Decomposition for Geometry-Aware Depth Completion [24.98850285904668]
Tri-Perspective view Decomposition (TPVD) is a novel framework that can explicitly model 3D geometry.
TPVD decomposes the original point cloud into three 2D views.
TPVD outperforms existing methods on KITTI, NYUv2, and SUN RGBD.
arXiv Detail & Related papers (2024-03-22T07:45:50Z) - Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [18.964403296437027]
Act3D represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand.
It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling.
arXiv Detail & Related papers (2023-06-30T17:34:06Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based
Perception [122.53774221136193]
State-of-the-art methods for driving-scene LiDAR-based perception often project the point clouds to 2D space and then process them via 2D convolution.
A natural remedy is to utilize the 3D voxelization and 3D convolution network.
We propose a new framework for the outdoor LiDAR segmentation, where cylindrical partition and asymmetrical 3D convolution networks are designed to explore the 3D geometric pattern.
arXiv Detail & Related papers (2021-09-12T06:25:11Z) - FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection [78.00922683083776]
It is non-trivial to make a general adapted 2D detector work in this 3D task.
In this technical report, we study this problem with a practice built on fully convolutional single-stage detector.
Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020.
arXiv Detail & Related papers (2021-04-22T09:35:35Z) - Exploring Deep 3D Spatial Encodings for Large-Scale 3D Scene
Understanding [19.134536179555102]
We propose an alternative approach to overcome the limitations of CNN based approaches by encoding the spatial features of raw 3D point clouds into undirected graph models.
The proposed method achieves on par state-of-the-art accuracy with improved training time and model stability thus indicating strong potential for further research.
arXiv Detail & Related papers (2020-11-29T12:56:19Z) - Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic
Segmentation [87.54570024320354]
State-of-the-art methods for large-scale driving-scene LiDAR semantic segmentation often project and process the point clouds in the 2D space.
A straightforward solution to tackle the issue of 3D-to-2D projection is to keep the 3D representation and process the points in the 3D space.
We develop a 3D cylinder partition and a 3D cylinder convolution based framework, termed as Cylinder3D, which exploits the 3D topology relations and structures of driving-scene point clouds.
arXiv Detail & Related papers (2020-08-04T13:56:19Z) - Implicit Functions in Feature Space for 3D Shape Reconstruction and
Completion [53.885984328273686]
Implicit Feature Networks (IF-Nets) deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data.
IF-Nets clearly outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions.
arXiv Detail & Related papers (2020-03-03T11:14:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.