Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks
- URL: http://arxiv.org/abs/2011.13118v3
- Date: Mon, 12 Jul 2021 16:02:54 GMT
- Title: Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks
- Authors: Xiaoxiao Long, Lingjie Liu, Wei Li, Christian Theobalt, Wenping Wang
- Abstract summary: We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
- Score: 87.50632573601283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel method for multi-view depth estimation from a single
video, which is a critical task in various applications, such as perception,
reconstruction and robot navigation. Although previous learning-based methods
have demonstrated compelling results, most works estimate depth maps of
individual video frames independently, without taking into consideration the
strong geometric and temporal coherence among the frames. Moreover, current
state-of-the-art (SOTA) models mostly adopt a fully 3D convolution network for
cost regularization and therefore require high computational cost, thus
limiting their deployment in real-world applications. Our method achieves
temporally coherent depth estimation results by using a novel Epipolar
Spatio-Temporal (EST) transformer to explicitly associate geometric and
temporal correlation with multiple estimated depth maps. Furthermore, to reduce
the computational cost, inspired by recent Mixture-of-Experts models, we design
a compact hybrid network consisting of a 2D context-aware network and a 3D
matching network which learn 2D context information and 3D disparity cues
separately. Extensive experiments demonstrate that our method achieves higher
accuracy in depth estimation and significant speedup than the SOTA methods.
Related papers
- SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via
Swin Transformer and Densely Cascaded Network [29.798579906253696]
It is challenging to acquire dense ground truth depth labels for supervised training, and the unsupervised depth estimation using monocular sequences emerges as a promising alternative.
In this paper, we employ a convolution-free Swin Transformer as an image feature extractor so that the network can capture both local geometric features and global semantic features for depth estimation.
Also, we propose a Densely Cascaded Multi-scale Network (DCMNet) that connects every feature map directly with another from different scales via a top-down cascade pathway.
arXiv Detail & Related papers (2023-01-17T06:01:46Z) - GraphCSPN: Geometry-Aware Depth Completion via Dynamic GCNs [49.55919802779889]
We propose a Graph Convolution based Spatial Propagation Network (GraphCSPN) as a general approach for depth completion.
In this work, we leverage convolution neural networks as well as graph neural networks in a complementary way for geometric representation learning.
Our method achieves the state-of-the-art performance, especially when compared in the case of using only a few propagation steps.
arXiv Detail & Related papers (2022-10-19T17:56:03Z) - 3DVNet: Multi-View Depth Prediction and Volumetric Refinement [68.68537312256144]
3DVNet is a novel multi-view stereo (MVS) depth-prediction method.
Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions.
We show that our method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics.
arXiv Detail & Related papers (2021-12-01T00:52:42Z) - Non-local Recurrent Regularization Networks for Multi-view Stereo [108.17325696835542]
In deep multi-view stereo networks, cost regularization is crucial to achieve accurate depth estimation.
We propose a novel non-local recurrent regularization network for multi-view stereo, named NR2-Net.
Our method achieves state-of-the-art reconstruction results on both DTU and Tanks and Temples datasets.
arXiv Detail & Related papers (2021-10-13T01:43:54Z) - VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction [71.83308989022635]
In this paper, we advocate that replicating the traditional two stages framework with deep neural networks improves both the interpretability and the accuracy of the results.
Our network operates in two steps: 1) the local computation of the local depth maps with a deep MVS technique, and, 2) the depth maps and images' features fusion to build a single TSDF volume.
In order to improve the matching performance between images acquired from very different viewpoints, we introduce a rotation-invariant 3D convolution kernel called PosedConv.
arXiv Detail & Related papers (2021-08-19T11:33:58Z) - Self-supervised Depth Estimation Leveraging Global Perception and
Geometric Smoothness Using On-board Videos [0.5276232626689566]
We present DLNet for pixel-wise depth estimation, which simultaneously extracts global and local features.
A three-dimensional geometry smoothness loss is proposed to predict a geometrically natural depth map.
In experiments on the KITTI and Make3D benchmarks, the proposed DLNet achieves performance competitive to those of the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-07T10:53:27Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - DELTAS: Depth Estimation by Learning Triangulation And densification of
Sparse points [14.254472131009653]
Multi-view stereo (MVS) is the golden mean between the accuracy of active depth sensing and the practicality of monocular depth estimation.
Cost volume based approaches employing 3D convolutional neural networks (CNNs) have considerably improved the accuracy of MVS systems.
We propose an efficient depth estimation approach by first (a) detecting and evaluating descriptors for interest points, then (b) learning to match and triangulate a small set of interest points, and finally (c) densifying this sparse set of 3D points using CNNs.
arXiv Detail & Related papers (2020-03-19T17:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.