Attention Attention Everywhere: Monocular Depth Prediction with Skip
Attention
- URL: http://arxiv.org/abs/2210.09071v1
- Date: Mon, 17 Oct 2022 13:14:47 GMT
- Title: Attention Attention Everywhere: Monocular Depth Prediction with Skip
Attention
- Authors: Ashutosh Agarwal and Chetan Arora
- Abstract summary: Monocular Depth Estimation (MDE) aims to predict pixel-wise depth given a single RGB image.
Inspired by the demonstrated benefits of attention in a multitude of computer vision problems, we propose an attention-based fusion of encoder and decoder features.
- Score: 6.491470878214977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Monocular Depth Estimation (MDE) aims to predict pixel-wise depth given a
single RGB image. For both, the convolutional as well as the recent
attention-based models, encoder-decoder-based architectures have been found to
be useful due to the simultaneous requirement of global context and pixel-level
resolution. Typically, a skip connection module is used to fuse the encoder and
decoder features, which comprises of feature map concatenation followed by a
convolution operation. Inspired by the demonstrated benefits of attention in a
multitude of computer vision problems, we propose an attention-based fusion of
encoder and decoder features. We pose MDE as a pixel query refinement problem,
where coarsest-level encoder features are used to initialize pixel-level
queries, which are then refined to higher resolutions by the proposed Skip
Attention Module (SAM). We formulate the prediction problem as ordinal
regression over the bin centers that discretize the continuous depth range and
introduce a Bin Center Predictor (BCP) module that predicts bins at the
coarsest level using pixel queries. Apart from the benefit of image adaptive
depth binning, the proposed design helps learn improved depth embedding in
initial pixel queries via direct supervision from the ground truth. Extensive
experiments on the two canonical datasets, NYUV2 and KITTI, show that our
architecture outperforms the state-of-the-art by 5.3% and 3.9%, respectively,
along with an improved generalization performance by 9.4% on the SUNRGBD
dataset. Code is available at https://github.com/ashutosh1807/PixelFormer.git.
Related papers
- Mask-adaptive Gated Convolution and Bi-directional Progressive Fusion
Network for Depth Completion [3.8558637038709622]
We propose a new model for depth completion based on an encoder-decoder structure.
Our model introduces two key components: the Mask-adaptive Gated Convolution architecture and the Bi-directional Progressive Fusion module.
We achieve remarkable performance in completing depth maps and outperformed existing approaches in terms of accuracy and reliability.
arXiv Detail & Related papers (2024-01-15T02:58:06Z) - PointHR: Exploring High-Resolution Architectures for 3D Point Cloud
Segmentation [77.44144260601182]
We explore high-resolution architectures for 3D point cloud segmentation.
We propose a unified pipeline named PointHR, which includes a knn-based sequence operator for feature extraction and a differential resampling operator.
To evaluate these architectures for dense point cloud analysis, we conduct thorough experiments using S3DIS and ScanNetV2 datasets.
arXiv Detail & Related papers (2023-10-11T09:29:17Z) - Low-Resolution Self-Attention for Semantic Segmentation [96.81482872022237]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.
Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.
We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z) - Depth Monocular Estimation with Attention-based Encoder-Decoder Network
from Single Image [7.753378095194288]
Vision-based approaches have recently received much attention and can overcome these drawbacks.
In this work, we explore an extreme scenario in vision-based settings: estimate a depth map from one monocular image severely plagued by grid artifacts and blurry edges.
Our novel approach can find the focus of current image with minimal overhead and avoid losses of depth features.
arXiv Detail & Related papers (2022-10-24T23:01:25Z) - Depthformer : Multiscale Vision Transformer For Monocular Depth
Estimation With Local Global Information Fusion [6.491470878214977]
This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset.
We propose a novel attention-based architecture, Depthformer for monocular depth estimation.
Our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE)
arXiv Detail & Related papers (2022-07-10T20:49:11Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Small Lesion Segmentation in Brain MRIs with Subpixel Embedding [105.1223735549524]
We present a method to segment MRI scans of the human brain into ischemic stroke lesion and normal tissues.
We propose a neural network architecture in the form of a standard encoder-decoder where predictions are guided by a spatial expansion embedding network.
arXiv Detail & Related papers (2021-09-18T00:21:17Z) - Pixel-Perfect Structure-from-Motion with Featuremetric Refinement [96.73365545609191]
We refine two key steps of structure-from-motion by a direct alignment of low-level image information from multiple views.
This significantly improves the accuracy of camera poses and scene geometry for a wide range of keypoint detectors.
Our system easily scales to large image collections, enabling pixel-perfect crowd-sourced localization at scale.
arXiv Detail & Related papers (2021-08-18T17:58:55Z) - High-resolution Depth Maps Imaging via Attention-based Hierarchical
Multi-modal Fusion [84.24973877109181]
We propose a novel attention-based hierarchical multi-modal fusion network for guided DSR.
We show that our approach outperforms state-of-the-art methods in terms of reconstruction accuracy, running speed and memory efficiency.
arXiv Detail & Related papers (2021-04-04T03:28:33Z) - AdaBins: Depth Estimation using Adaptive Bins [43.07310038858445]
We propose a transformer-based architecture block that divides the depth range into bins whose center value is estimated adaptively per image.
Our results show a decisive improvement over the state-of-the-art on several popular depth datasets.
arXiv Detail & Related papers (2020-11-28T14:40:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.