Transformers in Unsupervised Structure-from-Motion
- URL: http://arxiv.org/abs/2312.10529v1
- Date: Sat, 16 Dec 2023 20:00:34 GMT
- Title: Transformers in Unsupervised Structure-from-Motion
- Authors: Hemang Chawla, Arnav Varma, Elahe Arani, and Bahram Zonooz
- Abstract summary: Transformers have revolutionized deep learning based computer vision with improved performance as well as robustness to natural corruptions and adversarial attacks.
We propose a robust transformer-based monocular SfM method that learns to predict monocular pixel-wise depth, ego vehicle's translation and rotation, as well as camera's focal length and principal point, simultaneously.
Our study shows that transformer-based architecture achieves comparable performance while being more robust against natural corruptions, as well as untargeted and targeted attacks.
- Score: 19.43053045216986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have revolutionized deep learning based computer vision with
improved performance as well as robustness to natural corruptions and
adversarial attacks. Transformers are used predominantly for 2D vision tasks,
including image classification, semantic segmentation, and object detection.
However, robots and advanced driver assistance systems also require 3D scene
understanding for decision making by extracting structure-from-motion (SfM). We
propose a robust transformer-based monocular SfM method that learns to predict
monocular pixel-wise depth, ego vehicle's translation and rotation, as well as
camera's focal length and principal point, simultaneously. With experiments on
KITTI and DDAD datasets, we demonstrate how to adapt different vision
transformers and compare them against contemporary CNN-based methods. Our study
shows that transformer-based architecture, though lower in run-time efficiency,
achieves comparable performance while being more robust against natural
corruptions, as well as untargeted and targeted attacks.
Related papers
- Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy
for Image Recognition without Convolutions [1.1032962642000486]
This work is based on Vision Transformer, combined with the pyramid architecture, using Split-merge-transform to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT)
We perform image classification tasks on the CIFAR-10 dataset and object detection tasks on the COCO 2017 dataset.
arXiv Detail & Related papers (2022-03-02T09:14:28Z) - Transformers in Self-Supervised Monocular Depth Estimation with Unknown
Camera Intrinsics [13.7258515433446]
Self-supervised monocular depth estimation is an important task in 3D scene understanding.
We show how to adapt vision transformers for self-supervised monocular depth estimation.
Our study demonstrates how transformer-based architecture achieves comparable performance while being more robust and generalizable.
arXiv Detail & Related papers (2022-02-07T13:17:29Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - A Survey of Visual Transformers [30.082304742571598]
Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing.
Some pioneering works have recently been done on adapting Transformer architectures to Computer Vision (CV) fields.
We have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks.
arXiv Detail & Related papers (2021-11-11T07:56:04Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.