Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object
Problem by Semantic Guidance
- URL: http://arxiv.org/abs/2007.06936v2
- Date: Tue, 21 Jul 2020 11:00:22 GMT
- Title: Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object
Problem by Semantic Guidance
- Authors: Marvin Klingner, Jan-Aike Term\"ohlen, Jonas Mikolajczyk, Tim
Fingscheidt
- Abstract summary: Self-supervised monocular depth estimation presents a powerful method to obtain 3D scene information from single camera images.
We present a new self-supervised semantically-guided depth estimation (SGDepth) method to deal with moving dynamic-class (DC) objects.
- Score: 36.73303869405764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised monocular depth estimation presents a powerful method to
obtain 3D scene information from single camera images, which is trainable on
arbitrary image sequences without requiring depth labels, e.g., from a LiDAR
sensor. In this work we present a new self-supervised semantically-guided depth
estimation (SGDepth) method to deal with moving dynamic-class (DC) objects,
such as moving cars and pedestrians, which violate the static-world assumptions
typically made during training of such models. Specifically, we propose (i)
mutually beneficial cross-domain training of (supervised) semantic segmentation
and self-supervised depth estimation with task-specific network heads, (ii) a
semantic masking scheme providing guidance to prevent moving DC objects from
contaminating the photometric loss, and (iii) a detection method for frames
with non-moving DC objects, from which the depth of DC objects can be learned.
We demonstrate the performance of our method on several benchmarks, in
particular on the Eigen split, where we exceed all baselines without test-time
refinement.
Related papers
- Towards Domain Generalization for Multi-view 3D Object Detection in
Bird-Eye-View [11.958753088613637]
We first analyze the causes of the domain gap for the MV3D-Det task.
To acquire a robust depth prediction, we propose to decouple the depth estimation from intrinsic parameters of the camera.
We modify the focal length values to create multiple pseudo-domains and construct an adversarial training loss to encourage the feature representation to be more domain-agnostic.
arXiv Detail & Related papers (2023-03-03T02:59:13Z) - 3D Object Aided Self-Supervised Monocular Depth Estimation [5.579605877061333]
We propose a new method to address dynamic object movements through monocular 3D object detection.
Specifically, we first detect 3D objects in the images and build the per-pixel correspondence of the dynamic pixels with the detected object pose.
In this way, the depth of every pixel can be learned via a meaningful geometry model.
arXiv Detail & Related papers (2022-12-04T08:52:33Z) - SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for
Dynamic Scenes [58.89295356901823]
Self-supervised monocular depth estimation has shown impressive results in static scenes.
It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions.
We introduce an external pretrained monocular depth estimation model for generating single-image depth prior.
Our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes.
arXiv Detail & Related papers (2022-11-07T16:17:47Z) - Monocular 3D Object Detection with Depth from Motion [74.29588921594853]
We take advantage of camera ego-motion for accurate object depth estimation and detection.
Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon.
Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark.
arXiv Detail & Related papers (2022-07-26T15:48:46Z) - Attentive and Contrastive Learning for Joint Depth and Motion Field
Estimation [76.58256020932312]
Estimating the motion of the camera together with the 3D structure of the scene from a monocular vision system is a complex task.
We present a self-supervised learning framework for 3D object motion field estimation from monocular videos.
arXiv Detail & Related papers (2021-10-13T16:45:01Z) - Depth-conditioned Dynamic Message Propagation for Monocular 3D Object
Detection [86.25022248968908]
We learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection.
We show state-of-the-art results among the monocular-based approaches on the KITTI benchmark dataset.
arXiv Detail & Related papers (2021-03-30T16:20:24Z) - M3DSSD: Monocular 3D Single Stage Object Detector [82.25793227026443]
We propose a Monocular 3D Single Stage object Detector (M3DSSD) with feature alignment and asymmetric non-local attention.
The proposed M3DSSD achieves significantly better performance than the monocular 3D object detection methods on the KITTI dataset.
arXiv Detail & Related papers (2021-03-24T13:09:11Z) - IAFA: Instance-aware Feature Aggregation for 3D Object Detection from a
Single Image [37.83574424518901]
3D object detection from a single image is an important task in Autonomous Driving.
We propose an instance-aware approach to aggregate useful information for improving the accuracy of 3D object detection.
arXiv Detail & Related papers (2021-03-05T05:47:52Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z) - Semantics-Driven Unsupervised Learning for Monocular Depth and
Ego-Motion Estimation [33.83396613039467]
We propose a semantics-driven unsupervised learning approach for monocular depth and ego-motion estimation from videos.
Recent unsupervised learning methods employ photometric errors between synthetic view and actual image as a supervision signal for training.
arXiv Detail & Related papers (2020-06-08T05:55:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.