Moving SLAM: Fully Unsupervised Deep Learning in Non-Rigid Scenes
- URL: http://arxiv.org/abs/2105.02195v1
- Date: Wed, 5 May 2021 17:08:10 GMT
- Title: Moving SLAM: Fully Unsupervised Deep Learning in Non-Rigid Scenes
- Authors: Dan Xu, Andrea Vedaldi, Joao F. Henriques
- Abstract summary: We build on the idea of view synthesis, which uses classical camera geometry to re-render a source image from a different point-of-view.
By minimizing the error between the synthetic image and the corresponding real image in a video, the deep network that predicts pose and depth can be trained completely unsupervised.
- Score: 85.56602190773684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a method to train deep networks to decompose videos into 3D
geometry (camera and depth), moving objects, and their motions, with no
supervision. We build on the idea of view synthesis, which uses classical
camera geometry to re-render a source image from a different point-of-view,
specified by a predicted relative pose and depth map. By minimizing the error
between the synthetic image and the corresponding real image in a video, the
deep network that predicts pose and depth can be trained completely
unsupervised. However, the view synthesis equations rely on a strong
assumption: that objects do not move. This rigid-world assumption limits the
predictive power, and rules out learning about objects automatically. We
propose a simple solution: minimize the error on small regions of the image
instead. While the scene as a whole may be non-rigid, it is always possible to
find small regions that are approximately rigid, such as inside a moving
object. Our network can then predict different poses for each region, in a
sliding window. This represents a significantly richer model, including 6D
object motions, with little additional complexity. We establish new
state-of-the-art results on unsupervised odometry and depth prediction on
KITTI. We also demonstrate new capabilities on EPIC-Kitchens, a challenging
dataset of indoor videos, where there is no ground truth information for depth,
odometry, object segmentation or motion. Yet all are recovered automatically by
our method.
Related papers
- Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction [51.3632308129838]
We present Total-Decom, a novel method for decomposed 3D reconstruction with minimal human interaction.
Our approach seamlessly integrates the Segment Anything Model (SAM) with hybrid implicit-explicit neural surface representations and a mesh-based region-growing technique for accurate 3D object decomposition.
We extensively evaluate our method on benchmark datasets and demonstrate its potential for downstream applications, such as animation and scene editing.
arXiv Detail & Related papers (2024-03-28T11:12:33Z) - FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction.
Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z) - DnD: Dense Depth Estimation in Crowded Dynamic Indoor Scenes [68.38952377590499]
We present a novel approach for estimating depth from a monocular camera as it moves through complex indoor environments.
Our approach predicts absolute scale depth maps over the entire scene consisting of a static background and multiple moving people.
arXiv Detail & Related papers (2021-08-12T09:12:39Z) - Back to the Feature: Learning Robust Camera Localization from Pixels to
Pose [114.89389528198738]
We introduce PixLoc, a scene-agnostic neural network that estimates an accurate 6-DoF pose from an image and a 3D model.
The system can localize in large environments given coarse pose priors but also improve the accuracy of sparse feature matching.
arXiv Detail & Related papers (2021-03-16T17:40:12Z) - Shape and Viewpoint without Keypoints [63.26977130704171]
We present a learning framework that learns to recover the 3D shape, pose and texture from a single image.
We trained on an image collection without any ground truth 3D shape, multi-view, camera viewpoints or keypoint supervision.
We obtain state-of-the-art camera prediction results and show that we can learn to predict diverse shapes and textures across objects.
arXiv Detail & Related papers (2020-07-21T17:58:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.