MonoViT: Self-Supervised Monocular Depth Estimation with a Vision
Transformer
- URL: http://arxiv.org/abs/2208.03543v1
- Date: Sat, 6 Aug 2022 16:54:45 GMT
- Title: MonoViT: Self-Supervised Monocular Depth Estimation with a Vision
Transformer
- Authors: Chaoqiang Zhao, Youmin Zhang, Matteo Poggi, Fabio Tosi, Xianda Guo,
Zheng Zhu, Guan Huang, Yang Tang, Stefano Mattoccia
- Abstract summary: We propose MonoViT, a framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation.
By combining plain convolutions with Transformer blocks, our model can reason locally and globally, yielding depth prediction at a higher level of detail and accuracy.
- Score: 52.0699787446221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised monocular depth estimation is an attractive solution that
does not require hard-to-source depth labels for training. Convolutional neural
networks (CNNs) have recently achieved great success in this task. However,
their limited receptive field constrains existing network architectures to
reason only locally, dampening the effectiveness of the self-supervised
paradigm. In the light of the recent successes achieved by Vision Transformers
(ViTs), we propose MonoViT, a brand-new framework combining the global
reasoning enabled by ViT models with the flexibility of self-supervised
monocular depth estimation. By combining plain convolutions with Transformer
blocks, our model can reason locally and globally, yielding depth prediction at
a higher level of detail and accuracy, allowing MonoViT to achieve
state-of-the-art performance on the established KITTI dataset. Moreover,
MonoViT proves its superior generalization capacities on other datasets such as
Make3D and DrivingStereo.
Related papers
- ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights [61.36309876889977]
ViT-Lens enables efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space.
In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art.
We will release the results of ViT-Lens on more modalities in the near future.
arXiv Detail & Related papers (2023-08-20T07:26:51Z) - Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular
Depth Estimation [33.018300966769516]
Most State of the Art (SOTA) works in the self-supervised and unsupervised domain to predict disparity maps from a given input image.
Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales.
It does so using a mask-guided multi-stream convolution in the feature space to achieve state-of-the-art performance on most standard benchmarks.
arXiv Detail & Related papers (2022-11-20T20:00:21Z) - A lightweight Transformer-based model for fish landmark detection [4.08805092034476]
We develop a novel model architecture that we call a Mobile fish landmark detection network (MFLD-net)
MFLD-net can achieve competitive or better results in low data regimes while being lightweight.
Unlike ViT, MFLD-net does not need a pre-trained model and can generalise well when trained on a small dataset.
arXiv Detail & Related papers (2022-09-13T07:18:57Z) - Deep Digging into the Generalization of Self-Supervised Monocular Depth
Estimation [12.336888210144936]
Self-supervised monocular depth estimation has been widely studied recently.
We investigate the backbone networks (e.g. CNNs, Transformers, and CNN-Transformer hybrid models) toward the generalization of monocular depth estimation.
arXiv Detail & Related papers (2022-05-23T06:56:25Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z) - SGM3D: Stereo Guided Monocular 3D Object Detection [62.11858392862551]
We propose a stereo-guided monocular 3D object detection network, termed SGM3D.
We exploit robust 3D features extracted from stereo images to enhance the features learned from the monocular image.
Our method can be integrated into many other monocular approaches to boost performance without introducing any extra computational cost.
arXiv Detail & Related papers (2021-12-03T13:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.