D$^3$epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes
- URL: http://arxiv.org/abs/2411.04826v1
- Date: Thu, 07 Nov 2024 16:07:00 GMT
- Title: D$^3$epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes
- Authors: Siyu Chen, Hong Liu, Wenhao Li, Ying Zhu, Guoquan Wang, Jianbing Wu,
- Abstract summary: D$3$epth is a novel method for self-supervised depth estimation in dynamic scenes.
It tackles the challenge of dynamic objects from two key perspectives.
It consistently outperforms existing self-supervised monocular depth estimation baselines.
- Score: 23.731667977542454
- License:
- Abstract: Depth estimation is a crucial technology in robotics. Recently, self-supervised depth estimation methods have demonstrated great potential as they can efficiently leverage large amounts of unlabelled real-world data. However, most existing methods are designed under the assumption of static scenes, which hinders their adaptability in dynamic environments. To address this issue, we present D$^3$epth, a novel method for self-supervised depth estimation in dynamic scenes. It tackles the challenge of dynamic objects from two key perspectives. First, within the self-supervised framework, we design a reprojection constraint to identify regions likely to contain dynamic objects, allowing the construction of a dynamic mask that mitigates their impact at the loss level. Second, for multi-frame depth estimation, we introduce a cost volume auto-masking strategy that leverages adjacent frames to identify regions associated with dynamic objects and generate corresponding masks. This provides guidance for subsequent processes. Furthermore, we propose a spectral entropy uncertainty module that incorporates spectral entropy to guide uncertainty estimation during depth fusion, effectively addressing issues arising from cost volume computation in dynamic environments. Extensive experiments on KITTI and Cityscapes datasets demonstrate that the proposed method consistently outperforms existing self-supervised monocular depth estimation baselines. Code is available at \url{https://github.com/Csyunling/D3epth}.
Related papers
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation [23.93080319283679]
Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss.
Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation.
This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data.
arXiv Detail & Related papers (2024-04-23T10:51:15Z) - GAM-Depth: Self-Supervised Indoor Depth Estimation Leveraging a
Gradient-Aware Mask and Semantic Constraints [12.426365333096264]
We propose GAM-Depth, developed upon two novel components: gradient-aware mask and semantic constraints.
The gradient-aware mask enables adaptive and robust supervision for both key areas and textureless regions.
The incorporation of semantic constraints for indoor self-supervised depth estimation improves depth discrepancies at object boundaries.
arXiv Detail & Related papers (2024-02-22T07:53:34Z) - Manydepth2: Motion-Aware Self-Supervised Multi-Frame Monocular Depth Estimation in Dynamic Scenes [45.092076587934464]
We present Manydepth2, to achieve precise depth estimation for both dynamic objects and static backgrounds.
To tackle the challenges posed by dynamic content, we incorporate optical flow and coarse monocular depth to create a pseudo-static reference frame.
This frame is then utilized to build a motion-aware cost volume in collaboration with the vanilla target frame.
arXiv Detail & Related papers (2023-12-23T14:36:27Z) - Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks.
Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - D2SLAM: Semantic visual SLAM based on the influence of Depth for Dynamic
environments [0.483420384410068]
We propose a novel approach to determine dynamic elements that lack generalization and scene awareness.
We use scene depth information that refines the accuracy of estimates from geometric and semantic modules.
The obtained results demonstrate the efficacy of the proposed method in providing accurate localization and mapping in dynamic environments.
arXiv Detail & Related papers (2022-10-16T22:13:59Z) - Depth-conditioned Dynamic Message Propagation for Monocular 3D Object
Detection [86.25022248968908]
We learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection.
We show state-of-the-art results among the monocular-based approaches on the KITTI benchmark dataset.
arXiv Detail & Related papers (2021-03-30T16:20:24Z) - DOT: Dynamic Object Tracking for Visual SLAM [83.69544718120167]
DOT combines instance segmentation and multi-view geometry to generate masks for dynamic objects.
To determine which objects are actually moving, DOT segments first instances of potentially dynamic objects and then, with the estimated camera motion, tracks such objects by minimizing the photometric reprojection error.
Our results show that our approach improves significantly the accuracy and robustness of ORB-SLAM 2, especially in highly dynamic scenes.
arXiv Detail & Related papers (2020-09-30T18:36:28Z) - Self-Supervised Joint Learning Framework of Depth Estimation via
Implicit Cues [24.743099160992937]
We propose a novel self-supervised joint learning framework for depth estimation.
The proposed framework outperforms the state-of-the-art(SOTA) on KITTI and Make3D datasets.
arXiv Detail & Related papers (2020-06-17T13:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.