SynDistNet: Self-Supervised Monocular Fisheye Camera Distance Estimation
Synergized with Semantic Segmentation for Autonomous Driving
- URL: http://arxiv.org/abs/2008.04017v3
- Date: Sat, 14 Nov 2020 21:03:21 GMT
- Title: SynDistNet: Self-Supervised Monocular Fisheye Camera Distance Estimation
Synergized with Semantic Segmentation for Autonomous Driving
- Authors: Varun Ravi Kumar, Marvin Klingner, Senthil Yogamani, Stefan Milz, Tim
Fingscheidt and Patrick Maeder
- Abstract summary: State-of-the-art self-supervised learning approaches for monocular depth estimation usually suffer from scale ambiguity.
This paper introduces a novel multi-task learning strategy to improve self-supervised monocular distance estimation on fisheye and pinhole camera images.
- Score: 37.50089104051591
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art self-supervised learning approaches for monocular depth
estimation usually suffer from scale ambiguity. They do not generalize well
when applied on distance estimation for complex projection models such as in
fisheye and omnidirectional cameras. This paper introduces a novel multi-task
learning strategy to improve self-supervised monocular distance estimation on
fisheye and pinhole camera images. Our contribution to this work is threefold:
Firstly, we introduce a novel distance estimation network architecture using a
self-attention based encoder coupled with robust semantic feature guidance to
the decoder that can be trained in a one-stage fashion. Secondly, we integrate
a generalized robust loss function, which improves performance significantly
while removing the need for hyperparameter tuning with the reprojection loss.
Finally, we reduce the artifacts caused by dynamic objects violating static
world assumptions using a semantic masking strategy. We significantly improve
upon the RMSE of previous work on fisheye by 25% reduction in RMSE. As there is
little work on fisheye cameras, we evaluated the proposed method on KITTI using
a pinhole model. We achieved state-of-the-art performance among self-supervised
methods without requiring an external scale estimation.
Related papers
- FisheyeDepth: A Real Scale Self-Supervised Depth Estimation Model for Fisheye Camera [8.502741852406904]
We present FisheyeDepth, a self-supervised depth estimation model tailored for fisheye cameras.
We incorporate a fisheye camera model into the projection and reprojection stages during training to handle image distortions.
We also incorporate real-scale pose information into the geometric projection between consecutive frames, replacing the poses estimated by the conventional pose network.
arXiv Detail & Related papers (2024-09-23T14:31:42Z) - Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using transformers [39.14931758754381]
We introduce a novel fusion method that bypasses monocular depth estimation altogether.
We show that our model can modulate its use of camera features based on the availability of lidar features.
arXiv Detail & Related papers (2023-12-22T18:51:50Z) - Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image [85.91935485902708]
We show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models.
We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models.
Our method enables the accurate recovery of metric 3D structures on randomly collected internet images.
arXiv Detail & Related papers (2023-07-20T16:14:23Z) - FG-Depth: Flow-Guided Unsupervised Monocular Depth Estimation [17.572459787107427]
We propose a flow distillation loss to replace the typical photometric loss and a prior flow based mask to remove invalid pixels.
Our approach achieves state-of-the-art results on both KITTI and NYU-Depth-v2 datasets.
arXiv Detail & Related papers (2023-01-20T04:02:13Z) - CbwLoss: Constrained Bidirectional Weighted Loss for Self-supervised
Learning of Depth and Pose [13.581694284209885]
Photometric differences are used to train neural networks for estimating depth and camera pose from unlabeled monocular videos.
In this paper, we deal with moving objects and occlusions utilizing the difference of the flow fields and depth structure generated by affine transformation and view synthesis.
We mitigate the effect of textureless regions on model optimization by measuring differences between features with more semantic and contextual information without adding networks.
arXiv Detail & Related papers (2022-12-12T12:18:24Z) - SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z) - SelfTune: Metrically Scaled Monocular Depth Estimation through
Self-Supervised Learning [53.78813049373321]
We propose a self-supervised learning method for the pre-trained supervised monocular depth networks to enable metrically scaled depth estimation.
Our approach is useful for various applications such as mobile robot navigation and is applicable to diverse environments.
arXiv Detail & Related papers (2022-03-10T12:28:42Z) - Self-Supervised Multi-Frame Monocular Scene Flow [61.588808225321735]
We introduce a multi-frame monocular scene flow network based on self-supervised learning.
We observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.
arXiv Detail & Related papers (2021-05-05T17:49:55Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - $S^3$Net: Semantic-Aware Self-supervised Depth Estimation with Monocular
Videos and Synthetic Data [11.489124536853172]
$S3$Net is a self-supervised framework which combines synthetic and real-world images for training.
We present a unique way to train this self-supervised framework, and achieve (i.e.) more than $15%$ improvement over previous synthetic supervised approaches.
arXiv Detail & Related papers (2020-07-28T22:40:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.