Exploring the Mutual Influence between Self-Supervised Single-Frame and
Multi-Frame Depth Estimation
- URL: http://arxiv.org/abs/2304.12685v2
- Date: Mon, 28 Aug 2023 02:23:05 GMT
- Title: Exploring the Mutual Influence between Self-Supervised Single-Frame and
Multi-Frame Depth Estimation
- Authors: Jie Xiang, Yun Wang, Lifeng An, Haiyang Liu and Jian Liu
- Abstract summary: We propose a novel self-supervised training framework for single-frame and multi-frame depth estimation.
We first introduce a pixel-wise adaptive depth sampling module guided by single-frame depth to train the multi-frame model.
We then leverage the minimum reprojection based distillation loss to transfer the knowledge from the multi-frame depth network to the single-frame network.
- Score: 10.872396009088595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although both self-supervised single-frame and multi-frame depth estimation
methods only require unlabeled monocular videos for training, the information
they leverage varies because single-frame methods mainly rely on
appearance-based features while multi-frame methods focus on geometric cues.
Considering the complementary information of single-frame and multi-frame
methods, some works attempt to leverage single-frame depth to improve
multi-frame depth. However, these methods can neither exploit the difference
between single-frame depth and multi-frame depth to improve multi-frame depth
nor leverage multi-frame depth to optimize single-frame depth models. To fully
utilize the mutual influence between single-frame and multi-frame methods, we
propose a novel self-supervised training framework. Specifically, we first
introduce a pixel-wise adaptive depth sampling module guided by single-frame
depth to train the multi-frame model. Then, we leverage the minimum
reprojection based distillation loss to transfer the knowledge from the
multi-frame depth network to the single-frame network to improve single-frame
depth. Finally, we regard the improved single-frame depth as a prior to further
boost the performance of multi-frame depth estimation. Experimental results on
the KITTI and Cityscapes datasets show that our method outperforms existing
approaches in the self-supervised monocular setting.
Related papers
- A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding [76.44979557843367]
We propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior.
We introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information.
We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image.
arXiv Detail & Related papers (2024-11-04T08:50:16Z) - Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation.
Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model.
Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z) - Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation [11.611045114232187]
Recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance.
We try to synthesize more virtual camera views by flow-based video frame making (VFI)
For multi-frame inference, to sidestep the problem of dynamic objects encountered by explicit geometry-based methods like ManyDepth, we return to the feature fusion paradigm.
We construct a unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth.
arXiv Detail & Related papers (2024-07-19T08:51:51Z) - FusionDepth: Complement Self-Supervised Monocular Depth Estimation with
Cost Volume [9.912304015239313]
We propose a multi-frame depth estimation framework which monocular depth can be refined continuously by multi-frame sequential constraints.
Our method also enhances the interpretability when combining monocular estimation with multi-view cost volume.
arXiv Detail & Related papers (2023-05-10T10:38:38Z) - Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks.
Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z) - Multi-Frame Self-Supervised Depth Estimation with Multi-Scale Feature
Fusion in Dynamic Scenes [25.712707161201802]
Multi-frame methods improve monocular depth estimation over single-frame approaches.
Recent methods tend to propose complex architectures for feature matching and dynamic scenes.
We show that a simple learning framework, together with designed feature augmentation, leads to superior performance.
arXiv Detail & Related papers (2023-03-26T05:26:30Z) - Multi-Camera Collaborative Depth Prediction via Consistent Structure
Estimation [75.99435808648784]
We propose a novel multi-camera collaborative depth prediction method.
It does not require large overlapping areas while maintaining structure consistency between cameras.
Experimental results on DDAD and NuScenes datasets demonstrate the superior performance of our method.
arXiv Detail & Related papers (2022-10-05T03:44:34Z) - Multi-Frame Self-Supervised Depth with Transformers [33.00363651105475]
We propose a novel transformer architecture for cost volume generation.
We use depth-discretized epipolar sampling to select matching candidates.
We refine predictions through a series of self- and cross-attention layers.
arXiv Detail & Related papers (2022-04-15T19:04:57Z) - Improving Monocular Visual Odometry Using Learned Depth [84.05081552443693]
We propose a framework to exploit monocular depth estimation for improving visual odometry (VO)
The core of our framework is a monocular depth estimation module with a strong generalization capability for diverse scenes.
Compared with current learning-based VO methods, our method demonstrates a stronger generalization ability to diverse scenes.
arXiv Detail & Related papers (2022-04-04T06:26:46Z) - Video Depth Estimation by Fusing Flow-to-Depth Proposals [65.24533384679657]
We present an approach with a differentiable flow-to-depth layer for video depth estimation.
The model consists of a flow-to-depth layer, a camera pose refinement module, and a depth fusion network.
Our approach outperforms state-of-the-art depth estimation methods, and has reasonable cross dataset generalization capability.
arXiv Detail & Related papers (2019-12-30T10:45:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.