MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth
Estimation for Indoor Environments
- URL: http://arxiv.org/abs/2207.08951v1
- Date: Mon, 18 Jul 2022 21:34:43 GMT
- Title: MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth
Estimation for Indoor Environments
- Authors: Runze Li, Pan Ji, Yi Xu, Bir Bhanu
- Abstract summary: Self-supervised monocular depth estimation has seen significant progress in recent years, especially in outdoor environments.
However, depth prediction results are not satisfying in indoor scenes where most of the existing data are captured with hand-held devices.
We propose a novel framework-IndoorMono++ to improve the performance of self-supervised monocular depth estimation for indoor environments.
- Score: 45.89629401768049
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised monocular depth estimation has seen significant progress in
recent years, especially in outdoor environments. However, depth prediction
results are not satisfying in indoor scenes where most of the existing data are
captured with hand-held devices. As compared to outdoor environments,
estimating depth of monocular videos for indoor environments, using
self-supervised methods, results in two additional challenges: (i) the depth
range of indoor video sequences varies a lot across different frames, making it
difficult for the depth network to induce consistent depth cues for training;
(ii) the indoor sequences recorded with handheld devices often contain much
more rotational motions, which cause difficulties for the pose network to
predict accurate relative camera poses. In this work, we propose a novel
framework-MonoIndoor++ by giving special considerations to those challenges and
consolidating a set of good practices for improving the performance of
self-supervised monocular depth estimation for indoor environments. First, a
depth factorization module with transformer-based scale regression network is
proposed to estimate a global depth scale factor explicitly, and the predicted
scale factor can indicate the maximum depth values. Second, rather than using a
single-stage pose estimation strategy as in previous methods, we propose to
utilize a residual pose estimation module to estimate relative camera poses
across consecutive frames iteratively. Third, to incorporate extensive
coordinates guidance for our residual pose estimation module, we propose to
perform coordinate convolutional encoding directly over the inputs to pose
networks. The proposed method is validated on a variety of benchmark indoor
datasets, i.e., EuRoC MAV, NYUv2, ScanNet and 7-Scenes, demonstrating the
state-of-the-art performance.
Related papers
- ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation [62.600382533322325]
We propose a novel monocular depth estimation method called ScaleDepth.
Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction module.
Our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework.
arXiv Detail & Related papers (2024-07-11T05:11:56Z) - SCIPaD: Incorporating Spatial Clues into Unsupervised Pose-Depth Joint Learning [17.99904937160487]
We introduce SCIPaD, a novel approach that incorporates spatial clues for unsupervised depth-pose joint learning.
SCIPaD achieves a reduction of 22.2% in average translation error and 34.8% in average angular error for camera pose estimation task on the KITTI Odometry dataset.
arXiv Detail & Related papers (2024-07-07T06:52:51Z) - GEDepth: Ground Embedding for Monocular Depth Estimation [4.95394574147086]
This paper proposes a novel ground embedding module to decouple camera parameters from pictorial cues.
A ground attention is designed in the module to optimally combine ground depth with residual depth.
Experiments reveal that our approach achieves the state-of-the-art results on popular benchmarks.
arXiv Detail & Related papers (2023-09-18T17:56:06Z) - Multi-Camera Collaborative Depth Prediction via Consistent Structure
Estimation [75.99435808648784]
We propose a novel multi-camera collaborative depth prediction method.
It does not require large overlapping areas while maintaining structure consistency between cameras.
Experimental results on DDAD and NuScenes datasets demonstrate the superior performance of our method.
arXiv Detail & Related papers (2022-10-05T03:44:34Z) - Uncertainty Guided Depth Fusion for Spike Camera [49.41822923588663]
We propose a novel Uncertainty-Guided Depth Fusion (UGDF) framework to fuse predictions of monocular and stereo depth estimation networks for spike camera.
Our framework is motivated by the fact that stereo spike depth estimation achieves better results at close range.
In order to demonstrate the advantage of spike depth estimation over traditional camera depth estimation, we contribute a spike-depth dataset named CitySpike20K.
arXiv Detail & Related papers (2022-08-26T13:04:01Z) - MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth
Estimation for Indoor Environments [55.05401912853467]
Self-supervised depth estimation for indoor environments is more challenging than its outdoor counterpart.
The depth range of indoor sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues.
The maximum distance in outdoor scenes mostly stays the same as the camera usually sees the sky.
The motions of outdoor sequences are pre-dominantly translational, especially for driving datasets such as KITTI.
arXiv Detail & Related papers (2021-07-26T18:45:14Z) - Robust Consistent Video Depth Estimation [65.53308117778361]
We present an algorithm for estimating consistent dense depth maps and camera poses from a monocular video.
Our algorithm combines two complementary techniques: (1) flexible deformation-splines for low-frequency large-scale alignment and (2) geometry-aware depth filtering for high-frequency alignment of fine depth details.
In contrast to prior approaches, our method does not require camera poses as input and achieves robust reconstruction for challenging hand-held cell phone captures containing a significant amount of noise, shake, motion blur, and rolling shutter deformations.
arXiv Detail & Related papers (2020-12-10T18:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.