MonoViT: Self-Supervised Monocular Depth Estimation with a Vision
Transformer
- URL: http://arxiv.org/abs/2208.03543v1
- Date: Sat, 6 Aug 2022 16:54:45 GMT
- Title: MonoViT: Self-Supervised Monocular Depth Estimation with a Vision
Transformer
- Authors: Chaoqiang Zhao, Youmin Zhang, Matteo Poggi, Fabio Tosi, Xianda Guo,
Zheng Zhu, Guan Huang, Yang Tang, Stefano Mattoccia
- Abstract summary: We propose MonoViT, a framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation.
By combining plain convolutions with Transformer blocks, our model can reason locally and globally, yielding depth prediction at a higher level of detail and accuracy.
- Score: 52.0699787446221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised monocular depth estimation is an attractive solution that
does not require hard-to-source depth labels for training. Convolutional neural
networks (CNNs) have recently achieved great success in this task. However,
their limited receptive field constrains existing network architectures to
reason only locally, dampening the effectiveness of the self-supervised
paradigm. In the light of the recent successes achieved by Vision Transformers
(ViTs), we propose MonoViT, a brand-new framework combining the global
reasoning enabled by ViT models with the flexibility of self-supervised
monocular depth estimation. By combining plain convolutions with Transformer
blocks, our model can reason locally and globally, yielding depth prediction at
a higher level of detail and accuracy, allowing MonoViT to achieve
state-of-the-art performance on the established KITTI dataset. Moreover,
MonoViT proves its superior generalization capacities on other datasets such as
Make3D and DrivingStereo.
Related papers
- Mono2Stereo: Monocular Knowledge Transfer for Enhanced Stereo Matching [7.840781070208874]
We propose leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo.
We introduce knowledge transfer with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning.
Experimental results demonstrate that our pre-trained model exhibits strong zero-shot capabilities.
arXiv Detail & Related papers (2024-11-14T03:01:36Z) - Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition [49.14350399025926]
We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition.
Middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images.
arXiv Detail & Related papers (2024-07-28T11:52:36Z) - A lightweight Transformer-based model for fish landmark detection [4.08805092034476]
We develop a novel model architecture that we call a Mobile fish landmark detection network (MFLD-net)
MFLD-net can achieve competitive or better results in low data regimes while being lightweight.
Unlike ViT, MFLD-net does not need a pre-trained model and can generalise well when trained on a small dataset.
arXiv Detail & Related papers (2022-09-13T07:18:57Z) - Deep Digging into the Generalization of Self-Supervised Monocular Depth
Estimation [12.336888210144936]
Self-supervised monocular depth estimation has been widely studied recently.
We investigate the backbone networks (e.g. CNNs, Transformers, and CNN-Transformer hybrid models) toward the generalization of monocular depth estimation.
arXiv Detail & Related papers (2022-05-23T06:56:25Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - A Simple Single-Scale Vision Transformer for Object Localization and
Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT)
UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z) - SGM3D: Stereo Guided Monocular 3D Object Detection [62.11858392862551]
We propose a stereo-guided monocular 3D object detection network, termed SGM3D.
We exploit robust 3D features extracted from stereo images to enhance the features learned from the monocular image.
Our method can be integrated into many other monocular approaches to boost performance without introducing any extra computational cost.
arXiv Detail & Related papers (2021-12-03T13:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.