MGDepth: Motion-Guided Cost Volume For Self-Supervised Monocular Depth
In Dynamic Scenarios
- URL: http://arxiv.org/abs/2312.15268v1
- Date: Sat, 23 Dec 2023 14:36:27 GMT
- Title: MGDepth: Motion-Guided Cost Volume For Self-Supervised Monocular Depth
In Dynamic Scenarios
- Authors: Kaichen Zhou, Jia-Xing Zhong, Jia-Wang Bian, Qian Xie, Jian-Qing
Zheng, Niki Trigoni, Andrew Markham
- Abstract summary: MGDepth is a Motion-Guided Cost Volume Depth Net to achieve precise depth estimation for both dynamic objects and static backgrounds.
MGDepth achieves a significant reduction of approximately seven percent in root-mean-square error for self-supervised monocular depth estimation on the KITTI-2015 dataset.
- Score: 47.33082977365344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite advancements in self-supervised monocular depth estimation,
challenges persist in dynamic scenarios due to the dependence on assumptions
about a static world. In this paper, we present MGDepth, a Motion-Guided Cost
Volume Depth Net, to achieve precise depth estimation for both dynamic objects
and static backgrounds, all while maintaining computational efficiency. To
tackle the challenges posed by dynamic content, we incorporate optical flow and
coarse monocular depth to create a novel static reference frame. This frame is
then utilized to build a motion-guided cost volume in collaboration with the
target frame. Additionally, to enhance the accuracy and resilience of the
network structure, we introduce an attention-based depth net architecture to
effectively integrate information from feature maps with varying resolutions.
Compared to methods with similar computational costs, MGDepth achieves a
significant reduction of approximately seven percent in root-mean-square error
for self-supervised monocular depth estimation on the KITTI-2015 dataset.
Related papers
- D$^3$epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes [23.731667977542454]
D$3$epth is a novel method for self-supervised depth estimation in dynamic scenes.
It tackles the challenge of dynamic objects from two key perspectives.
It consistently outperforms existing self-supervised monocular depth estimation baselines.
arXiv Detail & Related papers (2024-11-07T16:07:00Z) - Self-supervised Monocular Depth Estimation with Large Kernel Attention [30.44895226042849]
We propose a self-supervised monocular depth estimation network to get finer details.
Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies.
Our method achieves competitive results on the KITTI dataset.
arXiv Detail & Related papers (2024-09-26T14:44:41Z) - Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation [23.93080319283679]
Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss.
Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation.
This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data.
arXiv Detail & Related papers (2024-04-23T10:51:15Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - Multi-Camera Collaborative Depth Prediction via Consistent Structure
Estimation [75.99435808648784]
We propose a novel multi-camera collaborative depth prediction method.
It does not require large overlapping areas while maintaining structure consistency between cameras.
Experimental results on DDAD and NuScenes datasets demonstrate the superior performance of our method.
arXiv Detail & Related papers (2022-10-05T03:44:34Z) - Improving Monocular Visual Odometry Using Learned Depth [84.05081552443693]
We propose a framework to exploit monocular depth estimation for improving visual odometry (VO)
The core of our framework is a monocular depth estimation module with a strong generalization capability for diverse scenes.
Compared with current learning-based VO methods, our method demonstrates a stronger generalization ability to diverse scenes.
arXiv Detail & Related papers (2022-04-04T06:26:46Z) - Depth-conditioned Dynamic Message Propagation for Monocular 3D Object
Detection [86.25022248968908]
We learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection.
We show state-of-the-art results among the monocular-based approaches on the KITTI benchmark dataset.
arXiv Detail & Related papers (2021-03-30T16:20:24Z) - Robust Consistent Video Depth Estimation [65.53308117778361]
We present an algorithm for estimating consistent dense depth maps and camera poses from a monocular video.
Our algorithm combines two complementary techniques: (1) flexible deformation-splines for low-frequency large-scale alignment and (2) geometry-aware depth filtering for high-frequency alignment of fine depth details.
In contrast to prior approaches, our method does not require camera poses as input and achieves robust reconstruction for challenging hand-held cell phone captures containing a significant amount of noise, shake, motion blur, and rolling shutter deformations.
arXiv Detail & Related papers (2020-12-10T18:59:48Z) - Self-Supervised Joint Learning Framework of Depth Estimation via
Implicit Cues [24.743099160992937]
We propose a novel self-supervised joint learning framework for depth estimation.
The proposed framework outperforms the state-of-the-art(SOTA) on KITTI and Make3D datasets.
arXiv Detail & Related papers (2020-06-17T13:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.