DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation
- URL: http://arxiv.org/abs/2203.14211v1
- Date: Sun, 27 Mar 2022 05:03:56 GMT
- Title: DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation
- Authors: Zhenyu Li, Zehui Chen, Xianming Liu, Junjun Jiang
- Abstract summary: Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
- Score: 50.08080424613603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper aims to address the problem of supervised monocular depth
estimation. We start with a meticulous pilot study to demonstrate that the
long-range correlation is essential for accurate depth estimation. Therefore,
we propose to leverage the Transformer to model this global context with an
effective attention mechanism. We also adopt an additional convolution branch
to preserve the local information as the Transformer lacks the spatial
inductive bias in modeling such contents. However, independent branches lead to
a shortage of connections between features. To bridge this gap, we design a
hierarchical aggregation and heterogeneous interaction module to enhance the
Transformer features via element-wise interaction and model the affinity
between the Transformer and the CNN features in a set-to-set translation
manner. Due to the unbearable memory cost caused by global attention on
high-resolution feature maps, we introduce the deformable scheme to reduce the
complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets
demonstrate that our proposed model, termed DepthFormer, surpasses
state-of-the-art monocular depth estimation methods with prominent margins.
Notably, it achieves the most competitive result on the highly competitive
KITTI depth estimation benchmark. Our codes and models are available at
https://github.com/zhyever/Monocular-Depth-Estimation-Toolbox.
Related papers
- Self-supervised Monocular Depth Estimation with Large Kernel Attention [30.44895226042849]
We propose a self-supervised monocular depth estimation network to get finer details.
Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies.
Our method achieves competitive results on the KITTI dataset.
arXiv Detail & Related papers (2024-09-26T14:44:41Z) - Point-aware Interaction and CNN-induced Refinement Network for RGB-D
Salient Object Detection [95.84616822805664]
We introduce CNNs-assisted Transformer architecture and propose a novel RGB-D SOD network with Point-aware Interaction and CNN-induced Refinement.
In order to alleviate the block effect and detail destruction problems brought by the Transformer naturally, we design a CNN-induced refinement (CNNR) unit for content refinement and supplementation.
arXiv Detail & Related papers (2023-08-17T11:57:49Z) - Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN [9.185929396989083]
We employ a sparse pixel approach to contrastively analyze the distinctions between Transformers and CNNs.
Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity.
We propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration.
arXiv Detail & Related papers (2023-08-16T12:46:52Z) - CompletionFormer: Depth Completion with Convolutions and Vision
Transformers [0.0]
This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure.
Our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods.
arXiv Detail & Related papers (2023-04-25T17:59:47Z) - Depthformer : Multiscale Vision Transformer For Monocular Depth
Estimation With Local Global Information Fusion [6.491470878214977]
This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset.
We propose a novel attention-based architecture, Depthformer for monocular depth estimation.
Our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE)
arXiv Detail & Related papers (2022-07-10T20:49:11Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Multi-Frame Self-Supervised Depth with Transformers [33.00363651105475]
We propose a novel transformer architecture for cost volume generation.
We use depth-discretized epipolar sampling to select matching candidates.
We refine predictions through a series of self- and cross-attention layers.
arXiv Detail & Related papers (2022-04-15T19:04:57Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Learning Selective Mutual Attention and Contrast for RGB-D Saliency
Detection [145.4919781325014]
How to effectively fuse cross-modal information is the key problem for RGB-D salient object detection.
Many models use the feature fusion strategy but are limited by the low-order point-to-point fusion methods.
We propose a novel mutual attention model by fusing attention and contexts from different modalities.
arXiv Detail & Related papers (2020-10-12T08:50:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.