Towards Comprehensive Monocular Depth Estimation: Multiple Heads Are
Better Than One
- URL: http://arxiv.org/abs/2111.08313v2
- Date: Mon, 25 Sep 2023 14:29:20 GMT
- Title: Towards Comprehensive Monocular Depth Estimation: Multiple Heads Are
Better Than One
- Authors: Shuwei Shao, Ran Li, Zhongcai Pei, Zhong Liu, Weihai Chen, Wentao Zhu,
Xingming Wu and Baochang Zhang
- Abstract summary: We propose to integrate the strengths of multiple weak depth predictor to build a comprehensive and accurate depth predictor.
Specifically, we construct multiple base (weak) depth predictors by utilizing different Transformer-based and convolutional neural network (CNN)-based architectures.
The resultant model, which we refer to as Transformer-assisted depth ensembles (TEDepth), achieves better results than previous state-of-the-art approaches.
- Score: 32.01675089157679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Depth estimation attracts widespread attention in the computer vision
community. However, it is still quite difficult to recover an accurate depth
map using only one RGB image. We observe a phenomenon that existing methods
tend to fail in different cases, caused by differences in network architecture,
loss function and so on. In this work, we investigate into the phenomenon and
propose to integrate the strengths of multiple weak depth predictor to build a
comprehensive and accurate depth predictor, which is critical for many
real-world applications, e.g., 3D reconstruction. Specifically, we construct
multiple base (weak) depth predictors by utilizing different Transformer-based
and convolutional neural network (CNN)-based architectures. Transformer
establishes long-range correlation while CNN preserves local information
ignored by Transformer due to the spatial inductive bias. Therefore, the
coupling of Transformer and CNN contributes to the generation of complementary
depth estimates, which are essential to achieve a comprehensive depth
predictor. Then, we design mixers to learn from multiple weak predictions and
adaptively fuse them into a strong depth estimate. The resultant model, which
we refer to as Transformer-assisted depth ensembles (TEDepth). On the standard
NYU-Depth-v2 and KITTI datasets, we thoroughly explore how the neural ensembles
affect the depth estimation and demonstrate that our TEDepth achieves better
results than previous state-of-the-art approaches. To validate the
generalizability across cameras, we directly apply the models trained on
NYU-Depth-v2 to the SUN RGB-D dataset without any fine-tuning, and the superior
results emphasize its strong generalizability.
Related papers
- Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN [9.185929396989083]
We employ a sparse pixel approach to contrastively analyze the distinctions between Transformers and CNNs.
Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity.
We propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration.
arXiv Detail & Related papers (2023-08-16T12:46:52Z) - CompletionFormer: Depth Completion with Convolutions and Vision
Transformers [0.0]
This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure.
Our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods.
arXiv Detail & Related papers (2023-04-25T17:59:47Z) - URCDC-Depth: Uncertainty Rectified Cross-Distillation with CutFlip for
Monocular Depth Estimation [24.03121823263355]
We introduce an uncertainty rectified cross-distillation between Transformer and convolutional neural network (CNN) to learn a unified depth estimator.
Specifically, we use the depth estimates from the Transformer branch and the CNN branch as pseudo labels to teach each other.
We propose a surprisingly simple yet highly effective data augmentation technique CutFlip, which enforces the model to exploit more valuable clues apart from the vertical image position for depth inference.
arXiv Detail & Related papers (2023-02-16T08:53:08Z) - SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via
Swin Transformer and Densely Cascaded Network [29.798579906253696]
It is challenging to acquire dense ground truth depth labels for supervised training, and the unsupervised depth estimation using monocular sequences emerges as a promising alternative.
In this paper, we employ a convolution-free Swin Transformer as an image feature extractor so that the network can capture both local geometric features and global semantic features for depth estimation.
Also, we propose a Densely Cascaded Multi-scale Network (DCMNet) that connects every feature map directly with another from different scales via a top-down cascade pathway.
arXiv Detail & Related papers (2023-01-17T06:01:46Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - 3DVNet: Multi-View Depth Prediction and Volumetric Refinement [68.68537312256144]
3DVNet is a novel multi-view stereo (MVS) depth-prediction method.
Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions.
We show that our method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics.
arXiv Detail & Related papers (2021-12-01T00:52:42Z) - VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction [71.83308989022635]
In this paper, we advocate that replicating the traditional two stages framework with deep neural networks improves both the interpretability and the accuracy of the results.
Our network operates in two steps: 1) the local computation of the local depth maps with a deep MVS technique, and, 2) the depth maps and images' features fusion to build a single TSDF volume.
In order to improve the matching performance between images acquired from very different viewpoints, we introduce a rotation-invariant 3D convolution kernel called PosedConv.
arXiv Detail & Related papers (2021-08-19T11:33:58Z) - Dilated Fully Convolutional Neural Network for Depth Estimation from a
Single Image [1.0131895986034314]
We present an advanced Dilated Fully Convolutional Neural Network to address the deficiencies of traditional CNNs.
Taking advantages of the exponential expansion of the receptive field in dilated convolutions, our model can minimize the loss of resolution.
We show experimentally on NYU Depth V2 datasets that the depth prediction obtained from our model is considerably closer to ground truth than that from traditional CNNs techniques.
arXiv Detail & Related papers (2021-03-12T23:19:32Z) - PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View
Depth Estimation with Neural Positional Encoding and Distilled Matting Loss [49.66736599668501]
We propose a self-supervised single-view pixel-level accurate depth estimation network, called PLADE-Net.
Our method shows unprecedented accuracy levels, exceeding 95% in terms of the $delta1$ metric on the KITTI dataset.
arXiv Detail & Related papers (2021-03-12T15:54:46Z) - Virtual Normal: Enforcing Geometric Constraints for Accurate and Robust
Depth Prediction [87.08227378010874]
We show the importance of the high-order 3D geometric constraints for depth prediction.
By designing a loss term that enforces a simple geometric constraint, we significantly improve the accuracy and robustness of monocular depth estimation.
We show state-of-the-art results of learning metric depth on NYU Depth-V2 and KITTI.
arXiv Detail & Related papers (2021-03-07T00:08:21Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.