HiMODE: A Hybrid Monocular Omnidirectional Depth Estimation Model
- URL: http://arxiv.org/abs/2204.05007v1
- Date: Mon, 11 Apr 2022 11:11:43 GMT
- Title: HiMODE: A Hybrid Monocular Omnidirectional Depth Estimation Model
- Authors: Masum Shah Junayed, Arezoo Sadeghzadeh, Md Baharul Islam, Lai-Kuan
Wong, Tarkan Aydin
- Abstract summary: HiMODE is a novel monocular omnidirectional depth estimation model based on a CNN+Transformer architecture.
We show that HiMODE can achieve state-of-the-art performance for 360deg monocular depth estimation.
- Score: 3.5290359800552946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular omnidirectional depth estimation is receiving considerable research
attention due to its broad applications for sensing 360{\deg} surroundings.
Existing approaches in this field suffer from limitations in recovering small
object details and data lost during the ground-truth depth map acquisition. In
this paper, a novel monocular omnidirectional depth estimation model, namely
HiMODE is proposed based on a hybrid CNN+Transformer (encoder-decoder)
architecture whose modules are efficiently designed to mitigate distortion and
computational cost, without performance degradation. Firstly, we design a
feature pyramid network based on the HNet block to extract high-resolution
features near the edges. The performance is further improved, benefiting from a
self and cross attention layer and spatial/temporal patches in the Transformer
encoder and decoder, respectively. Besides, a spatial residual block is
employed to reduce the number of parameters. By jointly passing the deep
features extracted from an input image at each backbone block, along with the
raw depth maps predicted by the transformer encoder-decoder, through a context
adjustment layer, our model can produce resulting depth maps with better visual
quality than the ground-truth. Comprehensive ablation studies demonstrate the
significance of each individual module. Extensive experiments conducted on
three datasets; Stanford3D, Matterport3D, and SunCG, demonstrate that HiMODE
can achieve state-of-the-art performance for 360{\deg} monocular depth
estimation.
Related papers
- Depth Estimation From Monocular Images With Enhanced Encoder-Decoder Architecture [0.0]
This paper introduces a novel deep learning-based approach using an encoder-decoder architecture.
The Inception-ResNet-v2 model is utilized as the encoder.
Experimental results on the NYU Depth V2 dataset show that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-10-15T13:46:19Z) - Self-supervised Monocular Depth Estimation with Large Kernel Attention [30.44895226042849]
We propose a self-supervised monocular depth estimation network to get finer details.
Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies.
Our method achieves competitive results on the KITTI dataset.
arXiv Detail & Related papers (2024-09-26T14:44:41Z) - GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.
Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z) - SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via
Swin Transformer and Densely Cascaded Network [29.798579906253696]
It is challenging to acquire dense ground truth depth labels for supervised training, and the unsupervised depth estimation using monocular sequences emerges as a promising alternative.
In this paper, we employ a convolution-free Swin Transformer as an image feature extractor so that the network can capture both local geometric features and global semantic features for depth estimation.
Also, we propose a Densely Cascaded Multi-scale Network (DCMNet) that connects every feature map directly with another from different scales via a top-down cascade pathway.
arXiv Detail & Related papers (2023-01-17T06:01:46Z) - Depthformer : Multiscale Vision Transformer For Monocular Depth
Estimation With Local Global Information Fusion [6.491470878214977]
This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset.
We propose a novel attention-based architecture, Depthformer for monocular depth estimation.
Our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE)
arXiv Detail & Related papers (2022-07-10T20:49:11Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Joint Learning of Salient Object Detection, Depth Estimation and Contour
Extraction [91.43066633305662]
We propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD)
Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks.
Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time.
arXiv Detail & Related papers (2022-03-09T17:20:18Z) - Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images
with Virtual Depth [64.29043589521308]
We propose a rendering module to augment the training data by synthesizing images with virtual-depths.
The rendering module takes as input the RGB image and its corresponding sparse depth image, outputs a variety of photo-realistic synthetic images.
Besides, we introduce an auxiliary module to improve the detection model by jointly optimizing it through a depth estimation task.
arXiv Detail & Related papers (2021-07-28T11:00:47Z) - PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View
Depth Estimation with Neural Positional Encoding and Distilled Matting Loss [49.66736599668501]
We propose a self-supervised single-view pixel-level accurate depth estimation network, called PLADE-Net.
Our method shows unprecedented accuracy levels, exceeding 95% in terms of the $delta1$ metric on the KITTI dataset.
arXiv Detail & Related papers (2021-03-12T15:54:46Z) - A Single Stream Network for Robust and Real-time RGB-D Salient Object
Detection [89.88222217065858]
We design a single stream network to use the depth map to guide early fusion and middle fusion between RGB and depth.
This model is 55.5% lighter than the current lightest model and runs at a real-time speed of 32 FPS when processing a $384 times 384$ image.
arXiv Detail & Related papers (2020-07-14T04:40:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.