D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual
Odometry
- URL: http://arxiv.org/abs/2003.01060v2
- Date: Sat, 28 Mar 2020 21:08:41 GMT
- Title: D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual
Odometry
- Authors: Nan Yang and Lukas von Stumberg and Rui Wang and Daniel Cremers
- Abstract summary: D3VO is a novel framework for monocular visual odometry that exploits deep networks on three levels -- deep depth, pose and uncertainty estimation.
We first propose a novel self-supervised monocular depth estimation network trained on stereo videos without any external supervision.
We model the photometric uncertainties of pixels on the input images, which improves the depth estimation accuracy.
- Score: 57.5549733585324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose D3VO as a novel framework for monocular visual odometry that
exploits deep networks on three levels -- deep depth, pose and uncertainty
estimation. We first propose a novel self-supervised monocular depth estimation
network trained on stereo videos without any external supervision. In
particular, it aligns the training image pairs into similar lighting condition
with predictive brightness transformation parameters. Besides, we model the
photometric uncertainties of pixels on the input images, which improves the
depth estimation accuracy and provides a learned weighting function for the
photometric residuals in direct (feature-less) visual odometry. Evaluation
results show that the proposed network outperforms state-of-the-art
self-supervised depth estimation networks. D3VO tightly incorporates the
predicted depth, pose and uncertainty into a direct visual odometry method to
boost both the front-end tracking as well as the back-end non-linear
optimization. We evaluate D3VO in terms of monocular visual odometry on both
the KITTI odometry benchmark and the EuRoC MAV dataset.The results show that
D3VO outperforms state-of-the-art traditional monocular VO methods by a large
margin. It also achieves comparable results to state-of-the-art stereo/LiDAR
odometry on KITTI and to the state-of-the-art visual-inertial odometry on EuRoC
MAV, while using only a single camera.
Related papers
- Self-supervised Monocular Depth Estimation on Water Scenes via Specular Reflection Prior [3.2120448116996103]
This paper proposes the first self-supervision for deep-learning depth estimation on water scenes via intra-frame priors.
In the first stage, a water segmentation network is performed to separate the reflection components from the entire image.
The photometric re-projection error, incorporating SmoothL1 and a novel photometric adaptive SSIM, is formulated to optimize pose and depth estimation.
arXiv Detail & Related papers (2024-04-10T17:25:42Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - GUPNet++: Geometry Uncertainty Propagation Network for Monocular 3D
Object Detection [95.8940731298518]
We propose a novel Geometry Uncertainty Propagation Network (GUPNet++)
It models the uncertainty propagation relationship of the geometry projection during training, improving the stability and efficiency of the end-to-end model learning.
Experiments show that the proposed approach not only obtains (state-of-the-art) SOTA performance in image-based monocular 3D detection but also demonstrates superiority in efficacy with a simplified framework.
arXiv Detail & Related papers (2023-10-24T08:45:15Z) - Depth Estimation Matters Most: Improving Per-Object Depth Estimation for
Monocular 3D Detection and Tracking [47.59619420444781]
Approaches to monocular 3D perception including detection and tracking often yield inferior performance when compared to LiDAR-based techniques.
We propose a multi-level fusion method that combines different representations (RGB and pseudo-LiDAR) and temporal information across multiple frames for objects (tracklets) to enhance per-object depth estimation.
arXiv Detail & Related papers (2022-06-08T03:37:59Z) - Neural Radiance Fields Approach to Deep Multi-View Photometric Stereo [103.08512487830669]
We present a modern solution to the multi-view photometric stereo problem (MVPS)
We procure the surface orientation using a photometric stereo (PS) image formation model and blend it with a multi-view neural radiance field representation to recover the object's surface geometry.
Our method performs neural rendering of multi-view images while utilizing surface normals estimated by a deep photometric stereo network.
arXiv Detail & Related papers (2021-10-11T20:20:03Z) - Scale-aware direct monocular odometry [4.111899441919165]
We present a framework for direct monocular odometry based on depth prediction from a deep neural network.
Our proposal largely outperforms classic monocular SLAM, being 5 to 9 times more precise, with an accuracy which is closer to that of stereo systems.
arXiv Detail & Related papers (2021-09-21T10:30:15Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - Deep Two-View Structure-from-Motion Revisited [83.93809929963969]
Two-view structure-from-motion (SfM) is the cornerstone of 3D reconstruction and visual SLAM.
We propose to revisit the problem of deep two-view SfM by leveraging the well-posedness of the classic pipeline.
Our method consists of 1) an optical flow estimation network that predicts dense correspondences between two frames; 2) a normalized pose estimation module that computes relative camera poses from the 2D optical flow correspondences, and 3) a scale-invariant depth estimation network that leverages epipolar geometry to reduce the search space, refine the dense correspondences, and estimate relative depth maps.
arXiv Detail & Related papers (2021-04-01T15:31:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.