Domain Adaptation of Networks for Camera Pose Estimation: Learning
Camera Pose Estimation Without Pose Labels
- URL: http://arxiv.org/abs/2111.14741v1
- Date: Mon, 29 Nov 2021 17:45:38 GMT
- Title: Domain Adaptation of Networks for Camera Pose Estimation: Learning
Camera Pose Estimation Without Pose Labels
- Authors: Jack Langerman, Ziming Qiu, G\'abor S\"or\"os, D\'avid Seb\H{o}k, Yao
Wang, Howard Huang
- Abstract summary: One of the key criticisms of deep learning is that large amounts of expensive and difficult-to-acquire training data are required to train models.
DANCE enables the training of models without access to any labels on the target task.
renders labeled synthetic images from the 3D model, and bridges the inevitable domain gap between synthetic and real images.
- Score: 8.409695277909421
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the key criticisms of deep learning is that large amounts of expensive
and difficult-to-acquire training data are required in order to train models
with high performance and good generalization capabilities. Focusing on the
task of monocular camera pose estimation via scene coordinate regression (SCR),
we describe a novel method, Domain Adaptation of Networks for Camera pose
Estimation (DANCE), which enables the training of models without access to any
labels on the target task. DANCE requires unlabeled images (without known
poses, ordering, or scene coordinate labels) and a 3D representation of the
space (e.g., a scanned point cloud), both of which can be captured with minimal
effort using off-the-shelf commodity hardware. DANCE renders labeled synthetic
images from the 3D model, and bridges the inevitable domain gap between
synthetic and real images by applying unsupervised image-level domain
adaptation techniques (unpaired image-to-image translation). When tested on
real images, the SCR model trained with DANCE achieved comparable performance
to its fully supervised counterpart (in both cases using PnP-RANSAC for final
pose estimation) at a fraction of the cost. Our code and dataset are available
at https://github.com/JackLangerman/dance
Related papers
- ContraNeRF: 3D-Aware Generative Model via Contrastive Learning with
Unsupervised Implicit Pose Embedding [40.36882490080341]
We propose a novel 3D-aware GAN optimization technique through contrastive learning with implicit pose embeddings.
We make the discriminator estimate a high-dimensional implicit pose embedding from a given image and perform contrastive learning on the pose embedding.
The proposed approach can be employed for the dataset, where the canonical camera pose is ill-defined because it does not look up or estimate camera poses.
arXiv Detail & Related papers (2023-04-27T07:53:13Z) - Markerless Camera-to-Robot Pose Estimation via Self-supervised
Sim-to-Real Transfer [26.21320177775571]
We propose an end-to-end pose estimation framework that is capable of online camera-to-robot calibration and a self-supervised training method.
Our framework combines deep learning and geometric vision for solving the robot pose, and the pipeline is fully differentiable.
arXiv Detail & Related papers (2023-02-28T05:55:42Z) - RUST: Latent Neural Scene Representations from Unposed Imagery [21.433079925439234]
Inferring structure of 3D scenes from 2D observations is a fundamental challenge in computer vision.
Recent popularized approaches based on neural scene representations have achieved tremendous impact.
RUST (Really Unposed Scene representation Transformer) is a pose-free approach to novel view trained on RGB images alone.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual
Data [69.64723752430244]
We introduce VirtualPose, a two-stage learning framework to exploit the hidden "free lunch" specific to this task.
The first stage transforms images to abstract geometry representations (AGR), and then the second maps them to 3D poses.
It addresses the generalization issue from two aspects: (1) the first stage can be trained on diverse 2D datasets to reduce the risk of over-fitting to limited appearance; (2) the second stage can be trained on diverse AGR synthesized from a large number of virtual cameras and poses.
arXiv Detail & Related papers (2022-07-20T14:47:28Z) - Perspective Flow Aggregation for Data-Limited 6D Object Pose Estimation [121.02948087956955]
For some applications, such as those in space or deep under water, acquiring real images, even unannotated, is virtually impossible.
We propose a method that can be trained solely on synthetic images, or optionally using a few additional real images.
It performs on par with methods that require annotated real images for training when not using any, and outperforms them considerably when using as few as twenty real images.
arXiv Detail & Related papers (2022-03-18T10:20:21Z) - Self-Supervised 3D Hand Pose Estimation from monocular RGB via
Contrastive Learning [50.007445752513625]
We propose a new self-supervised method for the structured regression task of 3D hand pose estimation.
We experimentally investigate the impact of invariant and equivariant contrastive objectives.
We show that a standard ResNet-152, trained on additional unlabeled data, attains an improvement of $7.6%$ in PA-EPE on FreiHAND.
arXiv Detail & Related papers (2021-06-10T17:48:57Z) - Visual Camera Re-Localization Using Graph Neural Networks and Relative
Pose Supervision [31.947525258453584]
Visual re-localization means using a single image as input to estimate the camera's location and orientation relative to a pre-recorded environment.
Our proposed method makes few special assumptions, and is fairly lightweight in training and testing.
We validate the effectiveness of our approach on both standard indoor (7-Scenes) and outdoor (Cambridge Landmarks) camera re-localization benchmarks.
arXiv Detail & Related papers (2021-04-06T14:29:03Z) - Leveraging Photometric Consistency over Time for Sparsely Supervised
Hand-Object Reconstruction [118.21363599332493]
We present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video.
Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses.
We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy.
arXiv Detail & Related papers (2020-04-28T12:03:14Z) - Learning Feature Descriptors using Camera Pose Supervision [101.56783569070221]
We propose a novel weakly-supervised framework that can learn feature descriptors solely from relative camera poses between images.
Because we no longer need pixel-level ground-truth correspondences, our framework opens up the possibility of training on much larger and more diverse datasets for better and unbiased descriptors.
arXiv Detail & Related papers (2020-04-28T06:35:27Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.