3D Hierarchical Refinement and Augmentation for Unsupervised Learning of
Depth and Pose from Monocular Video
- URL: http://arxiv.org/abs/2112.03045v1
- Date: Mon, 6 Dec 2021 13:46:48 GMT
- Title: 3D Hierarchical Refinement and Augmentation for Unsupervised Learning of
Depth and Pose from Monocular Video
- Authors: Guangming Wang, Jiquan Zhong, Shijie Zhao, Wenhua Wu, Zhe Liu, Hesheng
Wang
- Abstract summary: A novel unsupervised training framework is proposed with 3D hierarchical refinement and augmentation using explicit 3D geometry.
In this framework, the depth and pose estimations are hierarchically and mutually coupled to refine the estimated pose layer by layer.
Our visual odometry outperforms all recent unsupervised monocular learning-based methods and achieves competitive performance to the geometry-based method.
- Score: 16.613015664195224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Depth and ego-motion estimations are essential for the localization and
navigation of autonomous robots and autonomous driving. Recent studies make it
possible to learn the per-pixel depth and ego-motion from the unlabeled
monocular video. A novel unsupervised training framework is proposed with 3D
hierarchical refinement and augmentation using explicit 3D geometry. In this
framework, the depth and pose estimations are hierarchically and mutually
coupled to refine the estimated pose layer by layer. The intermediate view
image is proposed and synthesized by warping the pixels in an image with the
estimated depth and coarse pose. Then, the residual pose transformation can be
estimated from the new view image and the image of the adjacent frame to refine
the coarse pose. The iterative refinement is implemented in a differentiable
manner in this paper, making the whole framework optimized uniformly.
Meanwhile, a new image augmentation method is proposed for the pose estimation
by synthesizing a new view image, which creatively augments the pose in 3D
space but gets a new augmented 2D image. The experiments on KITTI demonstrate
that our depth estimation achieves state-of-the-art performance and even
surpasses recent approaches that utilize other auxiliary tasks. Our visual
odometry outperforms all recent unsupervised monocular learning-based methods
and achieves competitive performance to the geometry-based method, ORB-SLAM2
with back-end optimization.
Related papers
- Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation [41.98740330990215]
This work proposes a novel approach that bridges 2D vision foundation models with 3D tasks.
We leverage the zero-shot capabilities of vision-language models for image semantics.
We project the semantics into 3D space using the reconstructed metric depth, thereby providing 3D supervision.
arXiv Detail & Related papers (2025-03-10T09:54:40Z) - FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views [93.6881532277553]
We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images.
Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes.
arXiv Detail & Related papers (2025-02-17T18:54:05Z) - GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.
Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z) - Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting [75.7154104065613]
We introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process.
We also introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry.
arXiv Detail & Related papers (2024-04-30T17:59:40Z) - FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction.
Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z) - High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization [51.878078860524795]
We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views.
Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content.
arXiv Detail & Related papers (2022-11-28T18:59:52Z) - Learning Ego 3D Representation as Ray Tracing [42.400505280851114]
We present a novel end-to-end architecture for ego 3D representation learning from unconstrained camera views.
Inspired by the ray tracing principle, we design a polarized grid of "imaginary eyes" as the learnable ego 3D representation.
We show that our model outperforms all state-of-the-art alternatives significantly.
arXiv Detail & Related papers (2022-06-08T17:55:50Z) - GDRNPP: A Geometry-guided and Fully Learning-based Object Pose Estimator [51.89441403642665]
6D pose estimation of rigid objects is a long-standing and challenging task in computer vision.
Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses.
This paper introduces a fully learning-based object pose estimator.
arXiv Detail & Related papers (2021-02-24T09:11:31Z) - Residual Pose: A Decoupled Approach for Depth-based 3D Human Pose
Estimation [18.103595280706593]
We leverage recent advances in reliable 2D pose estimation with CNN to estimate the 3D pose of people from depth images.
Our approach achieves very competitive results both in accuracy and speed on two public datasets.
arXiv Detail & Related papers (2020-11-10T10:08:13Z) - Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware
Multi-view Geometry Consistency [40.56510679634943]
We propose a self-supervised training architecture by leveraging the multi-view geometry consistency.
We design three novel loss functions for multi-view consistency, including the pixel consistency loss, the depth consistency loss, and the facial landmark-based epipolar loss.
Our method is accurate and robust, especially under large variations of expressions, poses, and illumination conditions.
arXiv Detail & Related papers (2020-07-24T12:36:09Z) - Towards Realistic 3D Embedding via View Alignment [53.89445873577063]
This paper presents an innovative View Alignment GAN (VA-GAN) that composes new images by embedding 3D models into 2D background images realistically and automatically.
VA-GAN consists of a texture generator and a differential discriminator that are inter-connected and end-to-end trainable.
arXiv Detail & Related papers (2020-07-14T14:45:00Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.