Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular
Depth Estimation
- URL: http://arxiv.org/abs/2211.11066v1
- Date: Sun, 20 Nov 2022 20:00:21 GMT
- Title: Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular
Depth Estimation
- Authors: Snehal Singh Tomar, Maitreya Suin, A.N. Rajagopalan
- Abstract summary: Most State of the Art (SOTA) works in the self-supervised and unsupervised domain to predict disparity maps from a given input image.
Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales.
It does so using a mask-guided multi-stream convolution in the feature space to achieve state-of-the-art performance on most standard benchmarks.
- Score: 33.018300966769516
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With an unprecedented increase in the number of agents and systems that aim
to navigate the real world using visual cues and the rising impetus for 3D
Vision Models, the importance of depth estimation is hard to understate. While
supervised methods remain the gold standard in the domain, the copious amount
of paired stereo data required to train such models makes them impractical.
Most State of the Art (SOTA) works in the self-supervised and unsupervised
domain employ a ResNet-based encoder architecture to predict disparity maps
from a given input image which are eventually used alongside a camera pose
estimator to predict depth without direct supervision. The fully convolutional
nature of ResNets makes them susceptible to capturing per-pixel local
information only, which is suboptimal for depth prediction. Our key insight for
doing away with this bottleneck is to use Vision Transformers, which employ
self-attention to capture the global contextual information present in an input
image. Our model fuses per-pixel local information learned using two fully
convolutional depth encoders with global contextual information learned by a
transformer encoder at different scales. It does so using a mask-guided
multi-stream convolution in the feature space to achieve state-of-the-art
performance on most standard benchmarks.
Related papers
- Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation.
Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model.
Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z) - SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via
Swin Transformer and Densely Cascaded Network [29.798579906253696]
It is challenging to acquire dense ground truth depth labels for supervised training, and the unsupervised depth estimation using monocular sequences emerges as a promising alternative.
In this paper, we employ a convolution-free Swin Transformer as an image feature extractor so that the network can capture both local geometric features and global semantic features for depth estimation.
Also, we propose a Densely Cascaded Multi-scale Network (DCMNet) that connects every feature map directly with another from different scales via a top-down cascade pathway.
arXiv Detail & Related papers (2023-01-17T06:01:46Z) - Deep Convolutional Pooling Transformer for Deepfake Detection [54.10864860009834]
We propose a deep convolutional Transformer to incorporate decisive image features both locally and globally.
Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy.
The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
arXiv Detail & Related papers (2022-09-12T15:05:41Z) - Depthformer : Multiscale Vision Transformer For Monocular Depth
Estimation With Local Global Information Fusion [6.491470878214977]
This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset.
We propose a novel attention-based architecture, Depthformer for monocular depth estimation.
Our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE)
arXiv Detail & Related papers (2022-07-10T20:49:11Z) - Forecasting of depth and ego-motion with transformers and
self-supervision [0.0]
This paper addresses the problem of end-to-end self-supervised forecasting of depth and ego motion.
Given a sequence of raw images, the aim is to forecast both the geometry and ego-motion using a supervised self photometric loss.
The architecture is designed using both convolution and transformer modules.
arXiv Detail & Related papers (2022-06-15T10:14:11Z) - SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z) - Transformers in Self-Supervised Monocular Depth Estimation with Unknown
Camera Intrinsics [13.7258515433446]
Self-supervised monocular depth estimation is an important task in 3D scene understanding.
We show how to adapt vision transformers for self-supervised monocular depth estimation.
Our study demonstrates how transformer-based architecture achieves comparable performance while being more robust and generalizable.
arXiv Detail & Related papers (2022-02-07T13:17:29Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View
Depth Estimation with Neural Positional Encoding and Distilled Matting Loss [49.66736599668501]
We propose a self-supervised single-view pixel-level accurate depth estimation network, called PLADE-Net.
Our method shows unprecedented accuracy levels, exceeding 95% in terms of the $delta1$ metric on the KITTI dataset.
arXiv Detail & Related papers (2021-03-12T15:54:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.