Attentional Separation-and-Aggregation Network for Self-supervised
Depth-Pose Learning in Dynamic Scenes
- URL: http://arxiv.org/abs/2011.09369v1
- Date: Wed, 18 Nov 2020 16:07:30 GMT
- Title: Attentional Separation-and-Aggregation Network for Self-supervised
Depth-Pose Learning in Dynamic Scenes
- Authors: Feng Gao, Jincheng Yu, Hao Shen, Yu Wang, Huazhong Yang
- Abstract summary: Learning depth and ego-motion from unlabeled videos via self-supervision from epipolar projection can improve the robustness and accuracy of the 3D perception and localization of vision-based robots.
However, the rigid projection computed by ego-motion cannot represent all scene points, such as points on moving objects, leading to false guidance in these regions.
We propose an Attentional Separation-and-Aggregation Network (ASANet) which can learn to distinguish and extract the scene's static and dynamic characteristics via the attention mechanism.
- Score: 19.704284616226552
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning depth and ego-motion from unlabeled videos via self-supervision from
epipolar projection can improve the robustness and accuracy of the 3D
perception and localization of vision-based robots. However, the rigid
projection computed by ego-motion cannot represent all scene points, such as
points on moving objects, leading to false guidance in these regions. To
address this problem, we propose an Attentional Separation-and-Aggregation
Network (ASANet), which can learn to distinguish and extract the scene's static
and dynamic characteristics via the attention mechanism. We further propose a
novel MotionNet with an ASANet as the encoder, followed by two separate
decoders, to estimate the camera's ego-motion and the scene's dynamic motion
field. Then, we introduce an auto-selecting approach to detect the moving
objects for dynamic-aware learning automatically. Empirical experiments
demonstrate that our method can achieve the state-of-the-art performance on the
KITTI benchmark.
Related papers
- 3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding.
This paper introduces a novel approach to instance segmentation and tracking in first-person video.
By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z) - Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.
voxelization infers per-object occupancy probabilities at individual spatial locations.
Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z) - Optical Flow boosts Unsupervised Localization and Segmentation [22.625511865323183]
We propose a new loss term formulation that uses optical flow in unlabeled videos to encourage self-supervised ViT features to become closer to each other.
We use the proposed loss function to finetune vision transformers that were originally trained on static images.
arXiv Detail & Related papers (2023-07-25T16:45:35Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Neural Groundplans: Persistent Neural Scene Representations from a
Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation.
We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z) - Learning What and Where -- Unsupervised Disentangling Location and
Identity Tracking [0.44040106718326594]
We introduce an unsupervisedd LOCation and Identity tracking system (Loci)
Inspired by the dorsal-ventral pathways in the brain, Loci tackles the what-and-where binding problem by means of a self-supervised segregation mechanism.
Loci may set the stage for deeper, explanation-oriented video processing.
arXiv Detail & Related papers (2022-05-26T13:30:14Z) - Attentive and Contrastive Learning for Joint Depth and Motion Field
Estimation [76.58256020932312]
Estimating the motion of the camera together with the 3D structure of the scene from a monocular vision system is a complex task.
We present a self-supervised learning framework for 3D object motion field estimation from monocular videos.
arXiv Detail & Related papers (2021-10-13T16:45:01Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Editable Free-viewpoint Video Using a Layered Neural Representation [35.44420164057911]
We propose the first approach for editable free-viewpoint video generation for large-scale dynamic scenes using only sparse 16 cameras.
The core of our approach is a new layered neural representation, where each dynamic entity including the environment itself is formulated into a space-time coherent neural layered radiance representation called ST-NeRF.
Experiments demonstrate the effectiveness of our approach to achieve high-quality, photo-realistic, and editable free-viewpoint video generation for dynamic scenes.
arXiv Detail & Related papers (2021-04-30T06:50:45Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.