Self-Supervised Learning of Depth and Ego-Motion from Video by
Alternative Training and Geometric Constraints from 3D to 2D
- URL: http://arxiv.org/abs/2108.01980v1
- Date: Wed, 4 Aug 2021 11:40:53 GMT
- Title: Self-Supervised Learning of Depth and Ego-Motion from Video by
Alternative Training and Geometric Constraints from 3D to 2D
- Authors: Jiaojiao Fang, Guizhong Liu
- Abstract summary: Self-supervised learning of depth and ego-motion from unlabeled monocular video has acquired promising results.
In this paper, we aim to improve the depth-pose learning performance without the auxiliary tasks.
We design a log-scale 3D structural consistency loss to put more emphasis on the smaller depth values during training.
- Score: 5.481942307939029
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Self-supervised learning of depth and ego-motion from unlabeled monocular
video has acquired promising results and drawn extensive attention. Most
existing methods jointly train the depth and pose networks by photometric
consistency of adjacent frames based on the principle of structure-from-motion
(SFM). However, the coupling relationship of the depth and pose networks
seriously influences the learning performance, and the re-projection relations
is sensitive to scale ambiguity, especially for pose learning. In this paper,
we aim to improve the depth-pose learning performance without the auxiliary
tasks and address the above issues by alternative training each task and
incorporating the epipolar geometric constraints into the Iterative Closest
Point (ICP) based point clouds match process. Distinct from jointly training
the depth and pose networks, our key idea is to better utilize the mutual
dependency of these two tasks by alternatively training each network with
respective losses while fixing the other. We also design a log-scale 3D
structural consistency loss to put more emphasis on the smaller depth values
during training. To makes the optimization easier, we further incorporate the
epipolar geometry into the ICP based learning process for pose learning.
Extensive experiments on various benchmarks datasets indicate the superiority
of our algorithm over the state-of-the-art self-supervised methods.
Related papers
- Improving Video Violence Recognition with Human Interaction Learning on
3D Skeleton Point Clouds [88.87985219999764]
We develop a method for video violence recognition from a new perspective of skeleton points.
We first formulate 3D skeleton point clouds from human sequences extracted from videos.
We then perform interaction learning on these 3D skeleton point clouds.
arXiv Detail & Related papers (2023-08-26T12:55:18Z) - Towards Deeply Unified Depth-aware Panoptic Segmentation with
Bi-directional Guidance Learning [63.63516124646916]
We propose a deeply unified framework for depth-aware panoptic segmentation.
We propose a bi-directional guidance learning approach to facilitate cross-task feature learning.
Our method sets the new state of the art for depth-aware panoptic segmentation on both Cityscapes-DVPS and SemKITTI-DVPS datasets.
arXiv Detail & Related papers (2023-07-27T11:28:33Z) - Collaborative Learning for Hand and Object Reconstruction with
Attention-guided Graph Convolution [49.10497573378427]
Estimating the pose and shape of hands and objects under interaction finds numerous applications including augmented and virtual reality.
Our algorithm is optimisation to object models, and it learns the physical rules governing hand-object interaction.
Experiments using four widely-used benchmarks show that our framework achieves beyond state-of-the-art accuracy in 3D pose estimation, as well as recovers dense 3D hand and object shapes.
arXiv Detail & Related papers (2022-04-27T17:00:54Z) - Unsupervised Joint Learning of Depth, Optical Flow, Ego-motion from
Video [9.94001125780824]
Estimating geometric elements such as depth, camera motion, and optical flow from images is an important part of the robot's visual perception.
We use a joint self-supervised method to estimate the three geometric elements.
arXiv Detail & Related papers (2021-05-30T12:39:48Z) - Deep Two-View Structure-from-Motion Revisited [83.93809929963969]
Two-view structure-from-motion (SfM) is the cornerstone of 3D reconstruction and visual SLAM.
We propose to revisit the problem of deep two-view SfM by leveraging the well-posedness of the classic pipeline.
Our method consists of 1) an optical flow estimation network that predicts dense correspondences between two frames; 2) a normalized pose estimation module that computes relative camera poses from the 2D optical flow correspondences, and 3) a scale-invariant depth estimation network that leverages epipolar geometry to reduce the search space, refine the dense correspondences, and estimate relative depth maps.
arXiv Detail & Related papers (2021-04-01T15:31:20Z) - SOSD-Net: Joint Semantic Object Segmentation and Depth Estimation from
Monocular images [94.36401543589523]
We introduce the concept of semantic objectness to exploit the geometric relationship of these two tasks.
We then propose a Semantic Object and Depth Estimation Network (SOSD-Net) based on the objectness assumption.
To the best of our knowledge, SOSD-Net is the first network that exploits the geometry constraint for simultaneous monocular depth estimation and semantic segmentation.
arXiv Detail & Related papers (2021-01-19T02:41:03Z) - Monocular 3D Object Detection with Sequential Feature Association and
Depth Hint Augmentation [12.55603878441083]
FADNet is presented to address the task of monocular 3D object detection.
A dedicated depth hint module is designed to generate row-wise features named as depth hints.
The contributions of this work are validated by conducting experiments and ablation study on the KITTI benchmark.
arXiv Detail & Related papers (2020-11-30T07:19:14Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [36.414471128890284]
We tackle the essential problem of scale inconsistency for self-supervised joint depth-pose learning.
Most existing methods assume that a consistent scale of depth and pose can be learned across all input samples.
We propose a novel system that explicitly disentangles scale from the network estimation.
arXiv Detail & Related papers (2020-04-03T00:28:09Z) - Semantically-Guided Representation Learning for Self-Supervised
Monocular Depth [40.49380547487908]
We propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning.
Our method improves upon the state of the art for self-supervised monocular depth prediction over all pixels, fine-grained details, and per semantic categories.
arXiv Detail & Related papers (2020-02-27T18:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.