$S^3$Net: Semantic-Aware Self-supervised Depth Estimation with Monocular
Videos and Synthetic Data
- URL: http://arxiv.org/abs/2007.14511v1
- Date: Tue, 28 Jul 2020 22:40:54 GMT
- Title: $S^3$Net: Semantic-Aware Self-supervised Depth Estimation with Monocular
Videos and Synthetic Data
- Authors: Bin Cheng, Inderjot Singh Saggu, Raunak Shah, Gaurav Bansal, Dinesh
Bharadia
- Abstract summary: $S3$Net is a self-supervised framework which combines synthetic and real-world images for training.
We present a unique way to train this self-supervised framework, and achieve (i.e.) more than $15%$ improvement over previous synthetic supervised approaches.
- Score: 11.489124536853172
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Solving depth estimation with monocular cameras enables the possibility of
widespread use of cameras as low-cost depth estimation sensors in applications
such as autonomous driving and robotics. However, learning such a scalable
depth estimation model would require a lot of labeled data which is expensive
to collect. There are two popular existing approaches which do not require
annotated depth maps: (i) using labeled synthetic and unlabeled real data in an
adversarial framework to predict more accurate depth, and (ii) unsupervised
models which exploit geometric structure across space and time in monocular
video frames. Ideally, we would like to leverage features provided by both
approaches as they complement each other; however, existing methods do not
adequately exploit these additive benefits. We present $S^3$Net, a
self-supervised framework which combines these complementary features: we use
synthetic and real-world images for training while exploiting geometric,
temporal, as well as semantic constraints. Our novel consolidated architecture
provides a new state-of-the-art in self-supervised depth estimation using
monocular videos. We present a unique way to train this self-supervised
framework, and achieve (i) more than $15\%$ improvement over previous synthetic
supervised approaches that use domain adaptation and (ii) more than $10\%$
improvement over previous self-supervised approaches which exploit geometric
constraints from the real data.
Related papers
- TanDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs [5.6168844664788855]
This work presents TanDepth, a practical, online scale recovery method for obtaining metric depth results from relative estimations at inference-time.
Tailored for Unmanned Aerial Vehicle (UAV) applications, our method leverages sparse measurements from Global Digital Elevation Models (GDEM) by projecting them to the camera view.
An adaptation to the Cloth Simulation Filter is presented, which allows selecting ground points from the estimated depth map to then correlate with the projected reference points.
arXiv Detail & Related papers (2024-09-08T15:54:43Z) - S$^2$Contact: Graph-based Network for 3D Hand-Object Contact Estimation
with Semi-Supervised Learning [70.72037296392642]
We propose a novel semi-supervised framework that allows us to learn contact from monocular images.
Specifically, we leverage visual and geometric consistency constraints in large-scale datasets for generating pseudo-labels.
We show benefits from using a contact map that rules hand-object interactions to produce more accurate reconstructions.
arXiv Detail & Related papers (2022-08-01T14:05:23Z) - Multi-Frame Self-Supervised Depth with Transformers [33.00363651105475]
We propose a novel transformer architecture for cost volume generation.
We use depth-discretized epipolar sampling to select matching candidates.
We refine predictions through a series of self- and cross-attention layers.
arXiv Detail & Related papers (2022-04-15T19:04:57Z) - SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z) - Occlusion-Aware Self-Supervised Monocular 6D Object Pose Estimation [88.8963330073454]
We propose a novel monocular 6D pose estimation approach by means of self-supervised learning.
We leverage current trends in noisy student training and differentiable rendering to further self-supervise the model.
Our proposed self-supervision outperforms all other methods relying on synthetic data.
arXiv Detail & Related papers (2022-03-19T15:12:06Z) - Probabilistic and Geometric Depth: Detecting Objects in Perspective [78.00922683083776]
3D object detection is an important capability needed in various practical applications such as driver assistance systems.
Monocular 3D detection, as an economical solution compared to conventional settings relying on binocular vision or LiDAR, has drawn increasing attention recently but still yields unsatisfactory results.
This paper first presents a systematic study on this problem and observes that the current monocular 3D detection problem can be simplified as an instance depth estimation problem.
arXiv Detail & Related papers (2021-07-29T16:30:33Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - SynDistNet: Self-Supervised Monocular Fisheye Camera Distance Estimation
Synergized with Semantic Segmentation for Autonomous Driving [37.50089104051591]
State-of-the-art self-supervised learning approaches for monocular depth estimation usually suffer from scale ambiguity.
This paper introduces a novel multi-task learning strategy to improve self-supervised monocular distance estimation on fisheye and pinhole camera images.
arXiv Detail & Related papers (2020-08-10T10:52:47Z) - Self-Supervised Joint Learning Framework of Depth Estimation via
Implicit Cues [24.743099160992937]
We propose a novel self-supervised joint learning framework for depth estimation.
The proposed framework outperforms the state-of-the-art(SOTA) on KITTI and Make3D datasets.
arXiv Detail & Related papers (2020-06-17T13:56:59Z) - Don't Forget The Past: Recurrent Depth Estimation from Monocular Video [92.84498980104424]
We put three different types of depth estimation into a common framework.
Our method produces a time series of depth maps.
It can be applied to monocular videos only or be combined with different types of sparse depth patterns.
arXiv Detail & Related papers (2020-01-08T16:50:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.