Distilled Semantics for Comprehensive Scene Understanding from Videos
- URL: http://arxiv.org/abs/2003.14030v1
- Date: Tue, 31 Mar 2020 08:52:13 GMT
- Title: Distilled Semantics for Comprehensive Scene Understanding from Videos
- Authors: Fabio Tosi, Filippo Aleotti, Pierluigi Zama Ramirez, Matteo Poggi,
Samuele Salti, Luigi Di Stefano and Stefano Mattoccia
- Abstract summary: In this paper, we take an additional step toward holistic scene understanding with monocular cameras by learning depth and motion alongside with semantics.
We address the three tasks jointly by a novel training protocol based on knowledge distillation and self-supervision.
We show that it yields state-of-the-art results for monocular depth estimation, optical flow and motion segmentation.
- Score: 53.49501208503774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whole understanding of the surroundings is paramount to autonomous systems.
Recent works have shown that deep neural networks can learn geometry (depth)
and motion (optical flow) from a monocular video without any explicit
supervision from ground truth annotations, particularly hard to source for
these two tasks. In this paper, we take an additional step toward holistic
scene understanding with monocular cameras by learning depth and motion
alongside with semantics, with supervision for the latter provided by a
pre-trained network distilling proxy ground truth images. We address the three
tasks jointly by a) a novel training protocol based on knowledge distillation
and self-supervision and b) a compact network architecture which enables
efficient scene understanding on both power hungry GPUs and low-power embedded
platforms. We thoroughly assess the performance of our framework and show that
it yields state-of-the-art results for monocular depth estimation, optical flow
and motion segmentation.
Related papers
- Learning Optical Flow, Depth, and Scene Flow without Real-World Labels [33.586124995327225]
Self-supervised monocular depth estimation enables robots to learn 3D perception from raw video streams.
We propose DRAFT, a new method capable of jointly learning depth, optical flow, and scene flow.
arXiv Detail & Related papers (2022-03-28T20:52:12Z) - SelfTune: Metrically Scaled Monocular Depth Estimation through
Self-Supervised Learning [53.78813049373321]
We propose a self-supervised learning method for the pre-trained supervised monocular depth networks to enable metrically scaled depth estimation.
Our approach is useful for various applications such as mobile robot navigation and is applicable to diverse environments.
arXiv Detail & Related papers (2022-03-10T12:28:42Z) - A Deeper Look into DeepCap [96.67706102518238]
We propose a novel deep learning approach for monocular dense human performance capture.
Our method is trained in a weakly supervised manner based on multi-view supervision.
Our approach outperforms the state of the art in terms of quality and robustness.
arXiv Detail & Related papers (2021-11-20T11:34:33Z) - X-Distill: Improving Self-Supervised Monocular Depth via Cross-Task
Distillation [69.9604394044652]
We propose a novel method to improve the self-supervised training of monocular depth via cross-task knowledge distillation.
During training, we utilize a pretrained semantic segmentation teacher network and transfer its semantic knowledge to the depth network.
We extensively evaluate the efficacy of our proposed approach on the KITTI benchmark and compare it with the latest state of the art.
arXiv Detail & Related papers (2021-10-24T19:47:14Z) - MaAST: Map Attention with Semantic Transformersfor Efficient Visual
Navigation [4.127128889779478]
This work focuses on performing better or comparable to the existing learning-based solutions for visual navigation for autonomous agents.
We propose a method to encode vital scene semantics into a semantically informed, top-down egocentric map representation.
We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-03-21T12:01:23Z) - DeepCap: Monocular Human Performance Capture Using Weak Supervision [106.50649929342576]
We propose a novel deep learning approach for monocular dense human performance capture.
Our method is trained in a weakly supervised manner based on multi-view supervision.
Our approach outperforms the state of the art in terms of quality and robustness.
arXiv Detail & Related papers (2020-03-18T16:39:56Z) - Learning Depth With Very Sparse Supervision [57.911425589947314]
This paper explores the idea that perception gets coupled to 3D properties of the world via interaction with the environment.
We train a specialized global-local network architecture with what would be available to a robot interacting with the environment.
Experiments on several datasets show that, when ground truth is available even for just one of the image pixels, the proposed network can learn monocular dense depth estimation up to 22.5% more accurately than state-of-the-art approaches.
arXiv Detail & Related papers (2020-03-02T10:44:13Z) - Semantically-Guided Representation Learning for Self-Supervised
Monocular Depth [40.49380547487908]
We propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning.
Our method improves upon the state of the art for self-supervised monocular depth prediction over all pixels, fine-grained details, and per semantic categories.
arXiv Detail & Related papers (2020-02-27T18:40:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.