MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation
- URL: http://arxiv.org/abs/2307.14460v1
- Date: Wed, 26 Jul 2023 19:01:49 GMT
- Title: MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation
- Authors: Reiner Birkl, Diana Wofk, Matthias M\"uller
- Abstract summary: We release MiDaS v3.1 for monocular depth estimation, offering a variety of new models based on different encoder backbones.
We explore how using the most promising vision transformers as image encoders impacts depth estimation quality and runtime of the MiDaS architecture.
- Score: 4.563488428831042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We release MiDaS v3.1 for monocular depth estimation, offering a variety of
new models based on different encoder backbones. This release is motivated by
the success of transformers in computer vision, with a large variety of
pretrained vision transformers now available. We explore how using the most
promising vision transformers as image encoders impacts depth estimation
quality and runtime of the MiDaS architecture. Our investigation also includes
recent convolutional approaches that achieve comparable quality to vision
transformers in image classification tasks. While the previous release MiDaS
v3.0 solely leverages the vanilla vision transformer ViT, MiDaS v3.1 offers
additional models based on BEiT, Swin, SwinV2, Next-ViT and LeViT. These models
offer different performance-runtime tradeoffs. The best model improves the
depth estimation quality by 28% while efficient models enable downstream tasks
requiring high frame rates. We also describe the general process for
integrating new backbones. A video summarizing the work can be found at
https://youtu.be/UjaeNNFf9sE and the code is available at
https://github.com/isl-org/MiDaS.
Related papers
- MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model [87.71060849866093]
We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks.
Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on variable reference views and camera poses.
We present several training and model modifications to strengthen the model with scaled-up datasets.
arXiv Detail & Related papers (2024-11-25T07:34:23Z) - Scaling Vision Transformers to 22 Billion Parameters [140.67853929168382]
Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been scaled to nearly the same degree.
We present a recipe for highly efficient and stable training of a 22B- parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model.
ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
arXiv Detail & Related papers (2023-02-10T18:58:21Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - AdaptFormer: Adapting Vision Transformers for Scalable Visual
Recognition [39.443380221227166]
We propose an effective adaptation approach for Transformer, namely AdaptFormer.
It can adapt the pre-trained ViTs into many different image and video tasks efficiently.
It is able to increase the ViT's transferability without updating its original pre-trained parameters.
arXiv Detail & Related papers (2022-05-26T17:56:15Z) - ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation [76.35955924137986]
We show that a plain vision transformer with MAE pretraining can obtain superior performance after finetuning on human pose estimation datasets.
Our biggest ViTPose model based on the ViTAE-G backbone with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev set.
arXiv Detail & Related papers (2022-04-26T17:55:04Z) - Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture.
The approach basically has no new inventions, which is combined from MoCo v2 and BYOL.
The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.