URCDC-Depth: Uncertainty Rectified Cross-Distillation with CutFlip for
Monocular Depth Estimation
- URL: http://arxiv.org/abs/2302.08149v2
- Date: Fri, 17 Feb 2023 04:20:14 GMT
- Title: URCDC-Depth: Uncertainty Rectified Cross-Distillation with CutFlip for
Monocular Depth Estimation
- Authors: Shuwei Shao, Zhongcai Pei, Weihai Chen, Ran Li, Zhong Liu and Zhengguo
Li
- Abstract summary: We introduce an uncertainty rectified cross-distillation between Transformer and convolutional neural network (CNN) to learn a unified depth estimator.
Specifically, we use the depth estimates from the Transformer branch and the CNN branch as pseudo labels to teach each other.
We propose a surprisingly simple yet highly effective data augmentation technique CutFlip, which enforces the model to exploit more valuable clues apart from the vertical image position for depth inference.
- Score: 24.03121823263355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work aims to estimate a high-quality depth map from a single RGB image.
Due to the lack of depth clues, making full use of the long-range correlation
and the local information is critical for accurate depth estimation. Towards
this end, we introduce an uncertainty rectified cross-distillation between
Transformer and convolutional neural network (CNN) to learn a unified depth
estimator. Specifically, we use the depth estimates from the Transformer branch
and the CNN branch as pseudo labels to teach each other. Meanwhile, we model
the pixel-wise depth uncertainty to rectify the loss weights of noisy pseudo
labels. To avoid the large capacity gap induced by the strong Transformer
branch deteriorating the cross-distillation, we transfer the feature maps from
Transformer to CNN and design coupling units to assist the weak CNN branch to
leverage the transferred features. Furthermore, we propose a surprisingly
simple yet highly effective data augmentation technique CutFlip, which enforces
the model to exploit more valuable clues apart from the vertical image position
for depth inference. Extensive experiments demonstrate that our model,
termed~\textbf{URCDC-Depth}, exceeds previous state-of-the-art methods on the
KITTI, NYU-Depth-v2 and SUN RGB-D datasets, even with no additional
computational burden at inference time. The source code is publicly available
at \url{https://github.com/ShuweiShao/URCDC-Depth}.
Related papers
- Self-supervised Monocular Depth Estimation with Large Kernel Attention [30.44895226042849]
We propose a self-supervised monocular depth estimation network to get finer details.
Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies.
Our method achieves competitive results on the KITTI dataset.
arXiv Detail & Related papers (2024-09-26T14:44:41Z) - SDformer: Efficient End-to-End Transformer for Depth Completion [5.864200786548098]
Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor.
Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks.
To overcome the drawbacks of CNNs, a more effective and powerful method has been presented, which is an adaptive self-attention setting sequence-to-sequence model.
arXiv Detail & Related papers (2024-09-12T15:52:08Z) - AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation [51.143540967290114]
We propose a method that unlocks a wide range of previously-infeasible geometric augmentations for unsupervised depth computation and estimation.
This is achieved by reversing, or undo''-ing, geometric transformations to the coordinates of the output depth, warping the depth map back to the original reference frame.
arXiv Detail & Related papers (2023-10-15T05:15:45Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - Single Image Depth Prediction Made Better: A Multivariate Gaussian Take [163.14849753700682]
We introduce an approach that performs continuous modeling of per-pixel depth.
Our method's accuracy (named MG) is among the top on the KITTI depth-prediction benchmark leaderboard.
arXiv Detail & Related papers (2023-03-31T16:01:03Z) - Unsupervised Spike Depth Estimation via Cross-modality Cross-domain Knowledge Transfer [53.413305467674434]
We introduce open-source RGB data to support spike depth estimation, leveraging its annotations and spatial information.
We propose a cross-modality cross-domain (BiCross) framework to realize unsupervised spike depth estimation.
Our method achieves state-of-the-art (SOTA) performances, compared with RGB-oriented unsupervised depth estimation methods.
arXiv Detail & Related papers (2022-08-26T09:35:20Z) - Depthformer : Multiscale Vision Transformer For Monocular Depth
Estimation With Local Global Information Fusion [6.491470878214977]
This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset.
We propose a novel attention-based architecture, Depthformer for monocular depth estimation.
Our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE)
arXiv Detail & Related papers (2022-07-10T20:49:11Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Towards Comprehensive Monocular Depth Estimation: Multiple Heads Are
Better Than One [32.01675089157679]
We propose to integrate the strengths of multiple weak depth predictor to build a comprehensive and accurate depth predictor.
Specifically, we construct multiple base (weak) depth predictors by utilizing different Transformer-based and convolutional neural network (CNN)-based architectures.
The resultant model, which we refer to as Transformer-assisted depth ensembles (TEDepth), achieves better results than previous state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-16T09:09:05Z) - PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View
Depth Estimation with Neural Positional Encoding and Distilled Matting Loss [49.66736599668501]
We propose a self-supervised single-view pixel-level accurate depth estimation network, called PLADE-Net.
Our method shows unprecedented accuracy levels, exceeding 95% in terms of the $delta1$ metric on the KITTI dataset.
arXiv Detail & Related papers (2021-03-12T15:54:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.