Transformers in Self-Supervised Monocular Depth Estimation with Unknown
Camera Intrinsics
- URL: http://arxiv.org/abs/2202.03131v1
- Date: Mon, 7 Feb 2022 13:17:29 GMT
- Title: Transformers in Self-Supervised Monocular Depth Estimation with Unknown
Camera Intrinsics
- Authors: Arnav Varma, Hemang Chawla, Bahram Zonooz and Elahe Arani
- Abstract summary: Self-supervised monocular depth estimation is an important task in 3D scene understanding.
We show how to adapt vision transformers for self-supervised monocular depth estimation.
Our study demonstrates how transformer-based architecture achieves comparable performance while being more robust and generalizable.
- Score: 13.7258515433446
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The advent of autonomous driving and advanced driver assistance systems
necessitates continuous developments in computer vision for 3D scene
understanding. Self-supervised monocular depth estimation, a method for
pixel-wise distance estimation of objects from a single camera without the use
of ground truth labels, is an important task in 3D scene understanding.
However, existing methods for this task are limited to convolutional neural
network (CNN) architectures. In contrast with CNNs that use localized linear
operations and lose feature resolution across the layers, vision transformers
process at constant resolution with a global receptive field at every stage.
While recent works have compared transformers against their CNN counterparts
for tasks such as image classification, no study exists that investigates the
impact of using transformers for self-supervised monocular depth estimation.
Here, we first demonstrate how to adapt vision transformers for self-supervised
monocular depth estimation. Thereafter, we compare the transformer and
CNN-based architectures for their performance on KITTI depth prediction
benchmarks, as well as their robustness to natural corruptions and adversarial
attacks, including when the camera intrinsics are unknown. Our study
demonstrates how transformer-based architecture, though lower in run-time
efficiency, achieves comparable performance while being more robust and
generalizable.
Related papers
- Explainable Multi-Camera 3D Object Detection with Transformer-Based
Saliency Maps [0.0]
Vision Transformers (ViTs) have achieved state-of-the-art results on various computer vision tasks, including 3D object detection.
End-to-end implementation makes ViTs less explainable, which can be a challenge for deploying them in safety-critical applications.
We propose a novel method for generating saliency maps for a DetR-like ViT with multiple camera inputs used for 3D object detection.
arXiv Detail & Related papers (2023-12-22T11:03:12Z) - Transformers in Unsupervised Structure-from-Motion [19.43053045216986]
Transformers have revolutionized deep learning based computer vision with improved performance as well as robustness to natural corruptions and adversarial attacks.
We propose a robust transformer-based monocular SfM method that learns to predict monocular pixel-wise depth, ego vehicle's translation and rotation, as well as camera's focal length and principal point, simultaneously.
Our study shows that transformer-based architecture achieves comparable performance while being more robust against natural corruptions, as well as untargeted and targeted attacks.
arXiv Detail & Related papers (2023-12-16T20:00:34Z) - OCTraN: 3D Occupancy Convolutional Transformer Network in Unstructured
Traffic Scenarios [0.0]
We propose OCTraN, a transformer architecture that uses iterative-attention to convert 2D image features into 3D occupancy features.
We also develop a self-supervised training pipeline to generalize the model to any scene by eliminating the need for LiDAR ground truth.
arXiv Detail & Related papers (2023-07-20T15:06:44Z) - Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular
Depth Estimation [33.018300966769516]
Most State of the Art (SOTA) works in the self-supervised and unsupervised domain to predict disparity maps from a given input image.
Our model fuses per-pixel local information learned using two fully convolutional depth encoders with global contextual information learned by a transformer encoder at different scales.
It does so using a mask-guided multi-stream convolution in the feature space to achieve state-of-the-art performance on most standard benchmarks.
arXiv Detail & Related papers (2022-11-20T20:00:21Z) - Learning Explicit Object-Centric Representations with Vision
Transformers [81.38804205212425]
We build on the self-supervision task of masked autoencoding and explore its effectiveness for learning object-centric representations with transformers.
We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
arXiv Detail & Related papers (2022-10-25T16:39:49Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision.
We offer three insights based on simple and easy to implement variants of vision transformers.
We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - PCLs: Geometry-aware Neural Reconstruction of 3D Pose with Perspective
Crop Layers [111.55817466296402]
We introduce Perspective Crop Layers (PCLs) - a form of perspective crop of the region of interest based on the camera geometry.
PCLs deterministically remove the location-dependent perspective effects while leaving end-to-end training and the number of parameters of the underlying neural network.
PCL offers an easy way to improve the accuracy of existing 3D reconstruction networks by making them geometry aware.
arXiv Detail & Related papers (2020-11-27T08:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.