Monocular Scene Reconstruction with 3D SDF Transformers
- URL: http://arxiv.org/abs/2301.13510v1
- Date: Tue, 31 Jan 2023 09:54:20 GMT
- Title: Monocular Scene Reconstruction with 3D SDF Transformers
- Authors: Weihao Yuan, Xiaodong Gu, Heng Li, Zilong Dong, Siyu Zhu
- Abstract summary: We propose an SDF transformer network, which replaces the role of 3D CNN for better 3D feature aggregation.
Experiments on multiple datasets show that this 3D transformer network generates a more accurate and complete reconstruction.
- Score: 17.565474518578178
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Monocular scene reconstruction from posed images is challenging due to the
complexity of a large environment. Recent volumetric methods learn to directly
predict the TSDF volume and have demonstrated promising results in this task.
However, most methods focus on how to extract and fuse the 2D features to a 3D
feature volume, but none of them improve the way how the 3D volume is
aggregated. In this work, we propose an SDF transformer network, which replaces
the role of 3D CNN for better 3D feature aggregation. To reduce the explosive
computation complexity of the 3D multi-head attention, we propose a sparse
window attention module, where the attention is only calculated between the
non-empty voxels within a local window. Then a top-down-bottom-up 3D attention
network is built for 3D feature aggregation, where a dilate-attention structure
is proposed to prevent geometry degeneration, and two global modules are
employed to equip with global receptive fields. The experiments on multiple
datasets show that this 3D transformer network generates a more accurate and
complete reconstruction, which outperforms previous methods by a large margin.
Remarkably, the mesh accuracy is improved by 41.8%, and the mesh completeness
is improved by 25.3% on the ScanNet dataset. Project page:
https://weihaosky.github.io/sdfformer.
Related papers
- SWFormer: Sparse Window Transformer for 3D Object Detection in Point
Clouds [44.635939022626744]
3D object detection in point clouds is a core component for modern robotics and autonomous driving systems.
Key challenge in 3D object detection comes from the inherent sparse nature of point occupancy within the 3D scene.
We propose Sparse Window Transformer (SWFormer), a scalable and accurate model for 3D object detection.
arXiv Detail & Related papers (2022-10-13T21:37:53Z) - CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds [55.44204039410225]
We present a novel two-stage fully sparse convolutional 3D object detection framework, named CAGroup3D.
Our proposed method first generates some high-quality 3D proposals by leveraging the class-aware local group strategy on the object surface voxels.
To recover the features of missed voxels due to incorrect voxel-wise segmentation, we build a fully sparse convolutional RoI pooling module.
arXiv Detail & Related papers (2022-10-09T13:38:48Z) - Spatial Pruned Sparse Convolution for Efficient 3D Object Detection [41.62839541489369]
3D scenes are dominated by a large number of background points, which is redundant for the detection task that mainly needs to focus on foreground objects.
In this paper, we analyze major components of existing 3D CNNs and find that 3D CNNs ignore the redundancy of data and further amplify it in the down-sampling process, which brings a huge amount of extra and unnecessary computational overhead.
We propose a new convolution operator named spatial pruned sparse convolution (SPS-Conv), which includes two variants, spatial pruned submanifold sparse convolution (SPSS-Conv) and spatial pruned regular sparse convolution (SPRS
arXiv Detail & Related papers (2022-09-28T16:19:06Z) - Asymmetric 3D Context Fusion for Universal Lesion Detection [55.61873234187917]
3D networks are strong in 3D context yet lack supervised pretraining.
Existing 3D context fusion operators are designed to be spatially symmetric, performing identical operations on each 2D slice like convolutions.
We propose a novel asymmetric 3D context fusion operator (A3D), which uses different weights to fuse 3D context from different 2D slices.
arXiv Detail & Related papers (2021-09-17T16:25:10Z) - Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based
Perception [122.53774221136193]
State-of-the-art methods for driving-scene LiDAR-based perception often project the point clouds to 2D space and then process them via 2D convolution.
A natural remedy is to utilize the 3D voxelization and 3D convolution network.
We propose a new framework for the outdoor LiDAR segmentation, where cylindrical partition and asymmetrical 3D convolution networks are designed to explore the 3D geometric pattern.
arXiv Detail & Related papers (2021-09-12T06:25:11Z) - Anchor-free 3D Single Stage Detector with Mask-Guided Attention for
Point Cloud [79.39041453836793]
We develop a novel single-stage 3D detector for point clouds in an anchor-free manner.
We overcome this by converting the voxel-based sparse 3D feature volumes into the sparse 2D feature maps.
We propose an IoU-based detection confidence re-calibration scheme to improve the correlation between the detection confidence score and the accuracy of the bounding box regression.
arXiv Detail & Related papers (2021-08-08T13:42:13Z) - Multi-Modality Task Cascade for 3D Object Detection [22.131228757850373]
Many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data.
We propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions.
We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance.
arXiv Detail & Related papers (2021-07-08T17:55:01Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z) - Implicit Functions in Feature Space for 3D Shape Reconstruction and
Completion [53.885984328273686]
Implicit Feature Networks (IF-Nets) deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data.
IF-Nets clearly outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions.
arXiv Detail & Related papers (2020-03-03T11:14:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.