Fugu-MT 論文翻訳(概要): Moving SLAM: Fully Unsupervised Deep Learning in Non-Rigid Scenes

論文の概要: Moving SLAM: Fully Unsupervised Deep Learning in Non-Rigid Scenes

arxiv url: http://arxiv.org/abs/2105.02195v1
Date: Wed, 5 May 2021 17:08:10 GMT
ステータス: 翻訳完了
システム内更新日: 2021-05-06 12:44:31.944090
Title: Moving SLAM: Fully Unsupervised Deep Learning in Non-Rigid Scenes
Title（参考訳）: move slam: 非厳格なシーンで完全に教師なしのディープラーニング
Authors: Dan Xu, Andrea Vedaldi, Joao F. Henriques
Abstract要約: 従来のカメラ幾何学を用いて異なる視点からソースイメージを再レンダリングするビュー合成という考え方に基づいている。映像中の合成画像と対応する実画像との誤差を最小化することにより、ポーズや深さを予測するディープネットワークを完全に教師なしで訓練することができる。
参考スコア（独自算出の注目度）: 85.56602190773684
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a method to train deep networks to decompose videos into 3D geometry (camera and depth), moving objects, and their motions, with no supervision. We build on the idea of view synthesis, which uses classical camera geometry to re-render a source image from a different point-of-view, specified by a predicted relative pose and depth map. By minimizing the error between the synthetic image and the corresponding real image in a video, the deep network that predicts pose and depth can be trained completely unsupervised. However, the view synthesis equations rely on a strong assumption: that objects do not move. This rigid-world assumption limits the predictive power, and rules out learning about objects automatically. We propose a simple solution: minimize the error on small regions of the image instead. While the scene as a whole may be non-rigid, it is always possible to find small regions that are approximately rigid, such as inside a moving object. Our network can then predict different poses for each region, in a sliding window. This represents a significantly richer model, including 6D object motions, with little additional complexity. We establish new state-of-the-art results on unsupervised odometry and depth prediction on KITTI. We also demonstrate new capabilities on EPIC-Kitchens, a challenging dataset of indoor videos, where there is no ground truth information for depth, odometry, object segmentation or motion. Yet all are recovered automatically by our method.
Abstract（参考訳）: 本研究では,映像を3次元形状(カメラと奥行き)に分解する深層ネットワークを訓練する手法を提案する。従来のカメラ形状を用いて異なる視点からソースイメージを再レンダリングするビュー合成のアイデアを,予測した相対的なポーズと深度マップを用いて構築する。映像中の合成画像と対応する実画像との誤差を最小化することにより、ポーズや深さを予測するディープネットワークを完全に教師なしで訓練することができる。しかし、ビュー合成方程式は、オブジェクトが動かないという強い仮定に依存している。この厳密な世界仮説は予測力を制限し、自動的にオブジェクトの学習を除外する。画像の小さな領域でエラーを最小限に抑えるという簡単な解決策を提案する。全体は厳密でないかもしれないが、動く物体の内部など、ほぼ剛体な小さな領域を見つけることは常に可能である。ネットワークはスライディングウィンドウ内で、各領域の異なるポーズを予測できます。これは6Dオブジェクトの動きを含む、はるかにリッチなモデルであり、さらに複雑さがほとんどない。我々は,KITTIにおける教師なし計測と深度予測に関する最新の結果を確立した。また,屋内ビデオのデータセットであるEPIC-Kitchensには,深度,計測,物体のセグメンテーション,動きなどの根拠となる真実情報がない。しかし、すべては我々の方法で自動的に回収される。

論文の概要: Moving SLAM: Fully Unsupervised Deep Learning in Non-Rigid Scenes

関連論文リスト