RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate
Multi-View Stereo
- URL: http://arxiv.org/abs/2307.10233v1
- Date: Sun, 16 Jul 2023 02:10:47 GMT
- Title: RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate
Multi-View Stereo
- Authors: Yifei Shi, Junhua Xi, Dewen Hu, Zhiping Cai, Kai Xu
- Abstract summary: RayMVSNet learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth.
RayMVSNet++ achieves state-of-the-art performance on the ScanNet dataset.
- Score: 21.209964556493368
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning-based multi-view stereo (MVS) has by far centered around 3D
convolution on cost volumes. Due to the high computation and memory consumption
of 3D CNN, the resolution of output depth is often considerably limited.
Different from most existing works dedicated to adaptive refinement of cost
volumes, we opt to directly optimize the depth value along each camera ray,
mimicking the range finding of a laser scanner. This reduces the MVS problem to
ray-based depth optimization which is much more light-weight than full cost
volume optimization. In particular, we propose RayMVSNet which learns
sequential prediction of a 1D implicit field along each camera ray with the
zero-crossing point indicating scene depth. This sequential modeling, conducted
based on transformer features, essentially learns the epipolar line search in
traditional multi-view stereo. We devise a multi-task learning for better
optimization convergence and depth accuracy. We found the monotonicity property
of the SDFs along each ray greatly benefits the depth estimation. Our method
ranks top on both the DTU and the Tanks & Temples datasets over all previous
learning-based methods, achieving an overall reconstruction score of 0.33mm on
DTU and an F-score of 59.48% on Tanks & Temples. It is able to produce
high-quality depth estimation and point cloud reconstruction in challenging
scenarios such as objects/scenes with non-textured surface, severe occlusion,
and highly varying depth range. Further, we propose RayMVSNet++ to enhance
contextual feature aggregation for each ray through designing an attentional
gating unit to select semantically relevant neighboring rays within the local
frustum around that ray. RayMVSNet++ achieves state-of-the-art performance on
the ScanNet dataset. In particular, it attains an AbsRel of 0.058m and produces
accurate results on the two subsets of textureless regions and large depth
variation.
Related papers
- Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation? [61.234412062595155]
We present ANYU, a new virtually augmented version of the NYU depth v2 dataset, designed for monocular depth estimation.
In contrast to the well-known approach where full 3D scenes of a virtual world are utilized to generate artificial datasets, ANYU was created by incorporating RGB-D representations of virtual reality objects.
We show that ANYU improves the monocular depth estimation performance and generalization of deep neural networks with considerably different architectures.
arXiv Detail & Related papers (2024-04-15T05:44:03Z) - SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model [72.0795843450604]
Current approaches face challenges in maintaining consistent accuracy across diverse scenes.
These methods rely on extensive datasets comprising millions, if not tens of millions, of data for training.
This paper presents SM$4$Depth, a model that seamlessly works for both indoor and outdoor scenes.
arXiv Detail & Related papers (2024-03-13T14:08:25Z) - NeRF-Det++: Incorporating Semantic Cues and Perspective-aware Depth
Supervision for Indoor Multi-View 3D Detection [72.0098999512727]
NeRF-Det has achieved impressive performance in indoor multi-view 3D detection by utilizing NeRF to enhance representation learning.
We present three corresponding solutions, including semantic enhancement, perspective-aware sampling, and ordinal depth supervision.
The resulting algorithm, NeRF-Det++, has exhibited appealing performance in the ScanNetV2 and AR KITScenes datasets.
arXiv Detail & Related papers (2024-02-22T11:48:06Z) - ARAI-MVSNet: A multi-view stereo depth estimation network with adaptive
depth range and depth interval [19.28042366225802]
Multi-View Stereo(MVS) is a fundamental problem in geometric computer vision.
We present a novel multi-stage coarse-to-fine framework to achieve adaptive all-pixel depth range and depth interval.
Our model achieves state-of-the-art performance and yields competitive generalization ability.
arXiv Detail & Related papers (2023-08-17T14:52:11Z) - High-Resolution Synthetic RGB-D Datasets for Monocular Depth Estimation [3.349875948009985]
We generate a high-resolution synthetic depth dataset (HRSD) of dimension 1920 X 1080 from Grand Theft Auto (GTA-V), which contains 100,000 color images and corresponding dense ground truth depth maps.
For experiments and analysis, we train the DPT algorithm, a state-of-the-art transformer-based MDE algorithm on the proposed synthetic dataset, which significantly increases the accuracy of depth maps on different scenes by 9 %.
arXiv Detail & Related papers (2023-05-02T19:03:08Z) - RayMVSNet: Learning Ray-based 1D Implicit Fields for Accurate Multi-View
Stereo [35.22032072756035]
RayMVSNet learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth.
Our method ranks top on both the DTU and the Tanks & Temples datasets over all previous learning-based methods.
arXiv Detail & Related papers (2022-04-04T08:43:38Z) - 3DVNet: Multi-View Depth Prediction and Volumetric Refinement [68.68537312256144]
3DVNet is a novel multi-view stereo (MVS) depth-prediction method.
Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions.
We show that our method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics.
arXiv Detail & Related papers (2021-12-01T00:52:42Z) - Virtual Normal: Enforcing Geometric Constraints for Accurate and Robust
Depth Prediction [87.08227378010874]
We show the importance of the high-order 3D geometric constraints for depth prediction.
By designing a loss term that enforces a simple geometric constraint, we significantly improve the accuracy and robustness of monocular depth estimation.
We show state-of-the-art results of learning metric depth on NYU Depth-V2 and KITTI.
arXiv Detail & Related papers (2021-03-07T00:08:21Z) - OmniSLAM: Omnidirectional Localization and Dense Mapping for
Wide-baseline Multi-camera Systems [88.41004332322788]
We present an omnidirectional localization and dense mapping system for a wide-baseline multiview stereo setup with ultra-wide field-of-view (FOV) fisheye cameras.
For more practical and accurate reconstruction, we first introduce improved and light-weighted deep neural networks for the omnidirectional depth estimation.
We integrate our omnidirectional depth estimates into the visual odometry (VO) and add a loop closing module for global consistency.
arXiv Detail & Related papers (2020-03-18T05:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.