Related papers: Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics

Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics

URL: http://arxiv.org/abs/2202.03131v1
Date: Mon, 7 Feb 2022 13:17:29 GMT
Title: Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics
Authors: Arnav Varma, Hemang Chawla, Bahram Zonooz and Elahe Arani
Abstract summary: Self-supervised monocular depth estimation is an important task in 3D scene understanding. We show how to adapt vision transformers for self-supervised monocular depth estimation. Our study demonstrates how transformer-based architecture achieves comparable performance while being more robust and generalizable.
Score: 13.7258515433446
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The advent of autonomous driving and advanced driver assistance systems necessitates continuous developments in computer vision for 3D scene understanding. Self-supervised monocular depth estimation, a method for pixel-wise distance estimation of objects from a single camera without the use of ground truth labels, is an important task in 3D scene understanding. However, existing methods for this task are limited to convolutional neural network (CNN) architectures. In contrast with CNNs that use localized linear operations and lose feature resolution across the layers, vision transformers process at constant resolution with a global receptive field at every stage. While recent works have compared transformers against their CNN counterparts for tasks such as image classification, no study exists that investigates the impact of using transformers for self-supervised monocular depth estimation. Here, we first demonstrate how to adapt vision transformers for self-supervised monocular depth estimation. Thereafter, we compare the transformer and CNN-based architectures for their performance on KITTI depth prediction benchmarks, as well as their robustness to natural corruptions and adversarial attacks, including when the camera intrinsics are unknown. Our study demonstrates how transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust and generalizable.

Related papers

Explainable Multi-Camera 3D Object Detection with Transformer-Based Saliency Maps [0.0]
Vision Transformers (ViTs) have achieved state-of-the-art results on various computer vision tasks, including 3D object detection. End-to-end implementation makes ViTs less explainable, which can be a challenge for deploying them in safety-critical applications. We propose a novel method for generating saliency maps for a DetR-like ViT with multiple camera inputs used for 3D object detection.
arXiv Detail & Related papers (2023-12-22T11:03:12Z)
Transformers in Unsupervised Structure-from-Motion [19.43053045216986]
Transformers have revolutionized deep learning based computer vision with improved performance as well as robustness to natural corruptions and adversarial attacks. We propose a robust transformer-based monocular SfM method that learns to predict monocular pixel-wise depth, ego vehicle's translation and rotation, as well as camera's focal length and principal point, simultaneously. Our study shows that transformer-based architecture achieves comparable performance while being more robust against natural corruptions, as well as untargeted and targeted attacks.
arXiv Detail & Related papers (2023-12-16T20:00:34Z)
OCTraN: 3D Occupancy Convolutional Transformer Network in Unstructured Traffic Scenarios [0.0]
We propose OCTraN, a transformer architecture that uses iterative-attention to convert 2D image features into 3D occupancy features. We also develop a self-supervised training pipeline to generalize the model to any scene by eliminating the need for LiDAR ground truth.
arXiv Detail & Related papers (2023-07-20T15:06:44Z)
Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation [33.018300966769516]
Most State of the Art (SOTA) works in the self-supervised and unsupervised domain to predict disparity maps from a given input image. Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales. It does so using a mask-guided multi-stream convolution in the feature space to achieve state-of-the-art performance on most standard benchmarks.
arXiv Detail & Related papers (2022-11-20T20:00:21Z)
Learning Explicit Object-Centric Representations with Vision Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers. We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z)
3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field. We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks. We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z)
Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision. We offer three insights based on simple and easy to implement variants of vision transformers. We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z)
Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations. We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
PCLs: Geometry-aware Neural Reconstruction of 3D Pose with Perspective Crop Layers [111.55817466296402]
We introduce Perspective Crop Layers (PCLs) - a form of perspective crop of the region of interest based on the camera geometry. PCLs deterministically remove the location-dependent perspective effects while leaving end-to-end training and the number of parameters of the underlying neural network. PCL offers an easy way to improve the accuracy of existing 3D reconstruction networks by making them geometry aware.
arXiv Detail & Related papers (2020-11-27T08:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.