Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction
- URL: http://arxiv.org/abs/2004.10681v3
- Date: Fri, 7 Aug 2020 05:15:31 GMT
- Title: Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction
- Authors: Lokender Tiwari, Pan Ji, Quoc-Huy Tran, Bingbing Zhuang, Saket Anand,
Manmohan Chandraker
- Abstract summary: CNNs for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment.
We propose a joint narrow and wide baseline based self-improving framework, where on the one hand the CNN-predicted depth is leveraged to perform pseudo RGB-D feature-based SLAM.
On the other hand, the bundle-adjusted 3D scene structures and camera poses from the more principled geometric SLAM are injected back into the depth network through novel wide baseline losses.
- Score: 72.30870535815258
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Classical monocular Simultaneous Localization And Mapping (SLAM) and the
recently emerging convolutional neural networks (CNNs) for monocular depth
prediction represent two largely disjoint approaches towards building a 3D map
of the surrounding environment. In this paper, we demonstrate that the coupling
of these two by leveraging the strengths of each mitigates the other's
shortcomings. Specifically, we propose a joint narrow and wide baseline based
self-improving framework, where on the one hand the CNN-predicted depth is
leveraged to perform pseudo RGB-D feature-based SLAM, leading to better
accuracy and robustness than the monocular RGB SLAM baseline. On the other
hand, the bundle-adjusted 3D scene structures and camera poses from the more
principled geometric SLAM are injected back into the depth network through
novel wide baseline losses proposed for improving the depth prediction network,
which then continues to contribute towards better pose and 3D structure
estimation in the next iteration. We emphasize that our framework only requires
unlabeled monocular videos in both training and inference stages, and yet is
able to outperform state-of-the-art self-supervised monocular and stereo depth
prediction networks (e.g, Monodepth2) and feature-based monocular SLAM system
(i.e, ORB-SLAM). Extensive experiments on KITTI and TUM RGB-D datasets verify
the superiority of our self-improving geometry-CNN framework.
Related papers
- Double-Shot 3D Shape Measurement with a Dual-Branch Network [14.749887303860717]
We propose a dual-branch Convolutional Neural Network (CNN)-Transformer network (PDCNet) to process different structured light (SL) modalities.
Within PDCNet, a Transformer branch is used to capture global perception in the fringe images, while a CNN branch is designed to collect local details in the speckle images.
We show that our method can reduce fringe order ambiguity while producing high-accuracy results on a self-made dataset.
arXiv Detail & Related papers (2024-07-19T10:49:26Z) - MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface
Reconstruction [72.05649682685197]
State-of-the-art neural implicit methods allow for high-quality reconstructions of simple scenes from many input views.
This is caused primarily by the inherent ambiguity in the RGB reconstruction loss that does not provide enough constraints.
Motivated by recent advances in the area of monocular geometry prediction, we explore the utility these cues provide for improving neural implicit surface reconstruction.
arXiv Detail & Related papers (2022-06-01T17:58:15Z) - NVS-MonoDepth: Improving Monocular Depth Prediction with Novel View
Synthesis [74.4983052902396]
We propose a novel training method split in three main steps to improve monocular depth estimation.
Experimental results prove that our method achieves state-of-the-art or comparable performance on the KITTI and NYU-Depth-v2 datasets.
arXiv Detail & Related papers (2021-12-22T12:21:08Z) - 3DVNet: Multi-View Depth Prediction and Volumetric Refinement [68.68537312256144]
3DVNet is a novel multi-view stereo (MVS) depth-prediction method.
Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions.
We show that our method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics.
arXiv Detail & Related papers (2021-12-01T00:52:42Z) - Towards Comprehensive Monocular Depth Estimation: Multiple Heads Are
Better Than One [32.01675089157679]
We propose to integrate the strengths of multiple weak depth predictor to build a comprehensive and accurate depth predictor.
Specifically, we construct multiple base (weak) depth predictors by utilizing different Transformer-based and convolutional neural network (CNN)-based architectures.
The resultant model, which we refer to as Transformer-assisted depth ensembles (TEDepth), achieves better results than previous state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-16T09:09:05Z) - TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view
Stereo [55.30992853477754]
We present TANDEM, a real-time monocular tracking and dense framework.
For pose estimation, TANDEM performs photometric bundle adjustment based on a sliding window of alignments.
TANDEM shows state-of-the-art real-time 3D reconstruction performance.
arXiv Detail & Related papers (2021-11-14T19:01:02Z) - Monocular Depth Estimation Primed by Salient Point Detection and
Normalized Hessian Loss [43.950140695759764]
We propose an accurate and lightweight framework for monocular depth estimation based on a self-attention mechanism stemming from salient point detection.
We introduce a normalized Hessian loss term invariant to scaling and shear along the depth direction, which is shown to substantially improve the accuracy.
The proposed method achieves state-of-the-art results on NYU-Depth-v2 and KITTI while using 3.1-38.4 times smaller model in terms of the number of parameters than baseline approaches.
arXiv Detail & Related papers (2021-08-25T07:51:09Z) - Self-supervised Depth Estimation Leveraging Global Perception and
Geometric Smoothness Using On-board Videos [0.5276232626689566]
We present DLNet for pixel-wise depth estimation, which simultaneously extracts global and local features.
A three-dimensional geometry smoothness loss is proposed to predict a geometrically natural depth map.
In experiments on the KITTI and Make3D benchmarks, the proposed DLNet achieves performance competitive to those of the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-07T10:53:27Z) - Depth-conditioned Dynamic Message Propagation for Monocular 3D Object
Detection [86.25022248968908]
We learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection.
We show state-of-the-art results among the monocular-based approaches on the KITTI benchmark dataset.
arXiv Detail & Related papers (2021-03-30T16:20:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.