Learning Optical Flow, Depth, and Scene Flow without Real-World Labels
- URL: http://arxiv.org/abs/2203.15089v1
- Date: Mon, 28 Mar 2022 20:52:12 GMT
- Title: Learning Optical Flow, Depth, and Scene Flow without Real-World Labels
- Authors: Vitor Guizilini, Kuan-Hui Lee, Rares Ambrus, Adrien Gaidon
- Abstract summary: Self-supervised monocular depth estimation enables robots to learn 3D perception from raw video streams.
We propose DRAFT, a new method capable of jointly learning depth, optical flow, and scene flow.
- Score: 33.586124995327225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised monocular depth estimation enables robots to learn 3D
perception from raw video streams. This scalable approach leverages projective
geometry and ego-motion to learn via view synthesis, assuming the world is
mostly static. Dynamic scenes, which are common in autonomous driving and
human-robot interaction, violate this assumption. Therefore, they require
modeling dynamic objects explicitly, for instance via estimating pixel-wise 3D
motion, i.e. scene flow. However, the simultaneous self-supervised learning of
depth and scene flow is ill-posed, as there are infinitely many combinations
that result in the same 3D point. In this paper we propose DRAFT, a new method
capable of jointly learning depth, optical flow, and scene flow by combining
synthetic data with geometric self-supervision. Building upon the RAFT
architecture, we learn optical flow as an intermediate task to bootstrap depth
and scene flow learning via triangulation. Our algorithm also leverages
temporal and geometric consistency losses across tasks to improve multi-task
learning. Our DRAFT architecture simultaneously establishes a new state of the
art in all three tasks in the self-supervised monocular setting on the standard
KITTI benchmark. Project page: https://sites.google.com/tri.global/draft.
Related papers
- Incorporating dense metric depth into neural 3D representations for view synthesis and relighting [25.028859317188395]
In robotic applications, dense metric depth can often be measured directly using stereo and illumination can be controlled.
In this work we demonstrate a method to incorporate dense metric depth into the training of neural 3D representations.
We also discuss a multi-flash stereo camera system developed to capture the necessary data for our pipeline and show results on relighting and view synthesis.
arXiv Detail & Related papers (2024-09-04T20:21:13Z) - DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - Mono-hydra: Real-time 3D scene graph construction from monocular camera
input with IMU [0.0]
The ability of robots to autonomously navigate through 3D environments depends on their comprehension of spatial concepts.
3D scene graphs have emerged as a robust tool for representing the environment as a layered graph of concepts and their relationships.
This paper puts forth a real-time spatial perception system Mono-Hydra, combining a monocular camera and an IMU sensor setup, focusing on indoor scenarios.
arXiv Detail & Related papers (2023-08-10T11:58:38Z) - Evaluating Continual Learning Algorithms by Generating 3D Virtual
Environments [66.83839051693695]
Continual learning refers to the ability of humans and animals to incrementally learn over time in a given environment.
We propose to leverage recent advances in 3D virtual environments in order to approach the automatic generation of potentially life-long dynamic scenes with photo-realistic appearance.
A novel element of this paper is that scenes are described in a parametric way, thus allowing the user to fully control the visual complexity of the input stream the agent perceives.
arXiv Detail & Related papers (2021-09-16T10:37:21Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Occlusion Guided Self-supervised Scene Flow Estimation on 3D Point
Clouds [4.518012967046983]
Understanding the flow in 3D space of sparsely sampled points between two consecutive time frames is the core stone of modern geometric-driven systems.
This work presents a new self-supervised training method and an architecture for the 3D scene flow estimation under occlusions.
arXiv Detail & Related papers (2021-04-10T09:55:19Z) - Weakly Supervised Learning of Rigid 3D Scene Flow [81.37165332656612]
We propose a data-driven scene flow estimation algorithm exploiting the observation that many 3D scenes can be explained by a collection of agents moving as rigid bodies.
We showcase the effectiveness and generalization capacity of our method on four different autonomous driving datasets.
arXiv Detail & Related papers (2021-02-17T18:58:02Z) - Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z) - Distilled Semantics for Comprehensive Scene Understanding from Videos [53.49501208503774]
In this paper, we take an additional step toward holistic scene understanding with monocular cameras by learning depth and motion alongside with semantics.
We address the three tasks jointly by a novel training protocol based on knowledge distillation and self-supervision.
We show that it yields state-of-the-art results for monocular depth estimation, optical flow and motion segmentation.
arXiv Detail & Related papers (2020-03-31T08:52:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.