Related papers: MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

URL: http://arxiv.org/abs/2208.03543v1
Date: Sat, 6 Aug 2022 16:54:45 GMT
Title: MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer
Authors: Chaoqiang Zhao, Youmin Zhang, Matteo Poggi, Fabio Tosi, Xianda Guo, Zheng Zhu, Guan Huang, Yang Tang, Stefano Mattoccia
Abstract summary: We propose MonoViT, a framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation. By combining plain convolutions with Transformer blocks, our model can reason locally and globally, yielding depth prediction at a higher level of detail and accuracy.
Score: 52.0699787446221
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised monocular depth estimation is an attractive solution that does not require hard-to-source depth labels for training. Convolutional neural networks (CNNs) have recently achieved great success in this task. However, their limited receptive field constrains existing network architectures to reason only locally, dampening the effectiveness of the self-supervised paradigm. In the light of the recent successes achieved by Vision Transformers (ViTs), we propose MonoViT, a brand-new framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation. By combining plain convolutions with Transformer blocks, our model can reason locally and globally, yielding depth prediction at a higher level of detail and accuracy, allowing MonoViT to achieve state-of-the-art performance on the established KITTI dataset. Moreover, MonoViT proves its superior generalization capacities on other datasets such as Make3D and DrivingStereo.

Related papers

MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models [33.87605068407066]
We introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. In experiments on six benchmarks, MonoCT outperforms existing SOTA domain adaptation methods by large margins.
arXiv Detail & Related papers (2025-03-17T21:59:41Z)
Simple Self Organizing Map with Visual Transformer [1.3121410433987561]
Vision Transformers (ViTs) have demonstrated exceptional performance in various vision tasks. ViTs tend to underperform on smaller datasets due to their inherent lack of inductive biases. Self-Organizing Maps (SOMs) are inherently structured to preserve topology and spatial organization.
arXiv Detail & Related papers (2025-03-06T05:58:41Z)
Mono2Stereo: Monocular Knowledge Transfer for Enhanced Stereo Matching [7.840781070208874]
We propose leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo. We introduce knowledge transfer with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning. Experimental results demonstrate that our pre-trained model exhibits strong zero-shot capabilities.
arXiv Detail & Related papers (2024-11-14T03:01:36Z)
Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition [49.14350399025926]
We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition. Middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images.
arXiv Detail & Related papers (2024-07-28T11:52:36Z)
A lightweight Transformer-based model for fish landmark detection [4.08805092034476]
We develop a novel model architecture that we call a Mobile fish landmark detection network (MFLD-net) MFLD-net can achieve competitive or better results in low data regimes while being lightweight. Unlike ViT, MFLD-net does not need a pre-trained model and can generalise well when trained on a small dataset.
arXiv Detail & Related papers (2022-09-13T07:18:57Z)
Deep Digging into the Generalization of Self-Supervised Monocular Depth Estimation [12.336888210144936]
Self-supervised monocular depth estimation has been widely studied recently. We investigate the backbone networks (e.g. CNNs, Transformers, and CNN-Transformer hybrid models) toward the generalization of monocular depth estimation.
arXiv Detail & Related papers (2022-05-23T06:56:25Z)
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation. We propose to leverage the Transformer to model this global context with an effective attention mechanism. Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation [79.265315267391]
We propose a simple and compact ViT architecture called Universal Vision Transformer (UViT) UViT achieves strong performance on object detection and instance segmentation tasks.
arXiv Detail & Related papers (2021-12-17T20:11:56Z)
SGM3D: Stereo Guided Monocular 3D Object Detection [62.11858392862551]
We propose a stereo-guided monocular 3D object detection network, termed SGM3D. We exploit robust 3D features extracted from stereo images to enhance the features learned from the monocular image. Our method can be integrated into many other monocular approaches to boost performance without introducing any extra computational cost.
arXiv Detail & Related papers (2021-12-03T13:57:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.