Multi-Frame Self-Supervised Depth Estimation with Multi-Scale Feature
Fusion in Dynamic Scenes
- URL: http://arxiv.org/abs/2303.14628v2
- Date: Tue, 19 Dec 2023 05:28:13 GMT
- Title: Multi-Frame Self-Supervised Depth Estimation with Multi-Scale Feature
Fusion in Dynamic Scenes
- Authors: Jiquan Zhong, Xiaolin Huang, Xiao Yu
- Abstract summary: Multi-frame methods improve monocular depth estimation over single-frame approaches.
Recent methods tend to propose complex architectures for feature matching and dynamic scenes.
We show that a simple learning framework, together with designed feature augmentation, leads to superior performance.
- Score: 25.712707161201802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-frame methods improve monocular depth estimation over single-frame
approaches by aggregating spatial-temporal information via feature matching.
However, the spatial-temporal feature leads to accuracy degradation in dynamic
scenes. To enhance the performance, recent methods tend to propose complex
architectures for feature matching and dynamic scenes. In this paper, we show
that a simple learning framework, together with designed feature augmentation,
leads to superior performance. (1) A novel dynamic objects detecting method
with geometry explainability is proposed. The detected dynamic objects are
excluded during training, which guarantees the static environment assumption
and relieves the accuracy degradation problem of the multi-frame depth
estimation. (2) Multi-scale feature fusion is proposed for feature matching in
the multi-frame depth network, which improves feature matching, especially
between frames with large camera motion. (3) The robust knowledge distillation
with a robust teacher network and reliability guarantee is proposed, which
improves the multi-frame depth estimation without computation complexity
increase during the test. The experiments show that our proposed methods
achieve great performance improvement on the multi-frame depth estimation.
Related papers
- Manydepth2: Motion-Aware Self-Supervised Multi-Frame Monocular Depth Estimation in Dynamic Scenes [45.092076587934464]
We present Manydepth2, to achieve precise depth estimation for both dynamic objects and static backgrounds.
To tackle the challenges posed by dynamic content, we incorporate optical flow and coarse monocular depth to create a pseudo-static reference frame.
This frame is then utilized to build a motion-aware cost volume in collaboration with the vanilla target frame.
arXiv Detail & Related papers (2023-12-23T14:36:27Z) - Learning Monocular Depth in Dynamic Environment via Context-aware
Temporal Attention [9.837958401514141]
We present CTA-Depth, a Context-aware Temporal Attention guided network for multi-frame monocular Depth estimation.
Our approach achieves significant improvements over state-of-the-art approaches on three benchmark datasets.
arXiv Detail & Related papers (2023-05-12T11:48:32Z) - Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth
Estimation in Dynamic Scenes [51.20150148066458]
We propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the generalizationally crafted masks.
Experiments on real-world datasets prove the significant effectiveness and ability of the proposed method.
arXiv Detail & Related papers (2023-04-18T13:55:24Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Multi-Camera Collaborative Depth Prediction via Consistent Structure
Estimation [75.99435808648784]
We propose a novel multi-camera collaborative depth prediction method.
It does not require large overlapping areas while maintaining structure consistency between cameras.
Experimental results on DDAD and NuScenes datasets demonstrate the superior performance of our method.
arXiv Detail & Related papers (2022-10-05T03:44:34Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - Rethinking Pareto Frontier for Performance Evaluation of Deep Neural
Networks [2.167843405313757]
We re-define the efficiency measure using a multi-objective optimization.
We combine competing variables with nature simultaneously in a single relative efficiency measure.
This allows to rank deep models that run efficiently on different computing hardware, and combines inference efficiency with training efficiency objectively.
arXiv Detail & Related papers (2022-02-18T15:58:17Z) - Robust Consistent Video Depth Estimation [65.53308117778361]
We present an algorithm for estimating consistent dense depth maps and camera poses from a monocular video.
Our algorithm combines two complementary techniques: (1) flexible deformation-splines for low-frequency large-scale alignment and (2) geometry-aware depth filtering for high-frequency alignment of fine depth details.
In contrast to prior approaches, our method does not require camera poses as input and achieves robust reconstruction for challenging hand-held cell phone captures containing a significant amount of noise, shake, motion blur, and rolling shutter deformations.
arXiv Detail & Related papers (2020-12-10T18:59:48Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - Self-Supervised Joint Learning Framework of Depth Estimation via
Implicit Cues [24.743099160992937]
We propose a novel self-supervised joint learning framework for depth estimation.
The proposed framework outperforms the state-of-the-art(SOTA) on KITTI and Make3D datasets.
arXiv Detail & Related papers (2020-06-17T13:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.