Do We Really Need Scene-specific Pose Encoders?
- URL: http://arxiv.org/abs/2012.12014v1
- Date: Tue, 22 Dec 2020 13:59:52 GMT
- Title: Do We Really Need Scene-specific Pose Encoders?
- Authors: Yoli Shavit and Ron Ferens
- Abstract summary: Visual pose regression models estimate the camera pose from a query image with a single forward pass.
Current models learn pose encoding from an image using deep convolutional networks which are trained per scene.
We propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Visual pose regression models estimate the camera pose from a query image
with a single forward pass. Current models learn pose encoding from an image
using deep convolutional networks which are trained per scene. The resulting
encoding is typically passed to a multi-layer perceptron in order to regress
the pose. In this work, we propose that scene-specific pose encoders are not
required for pose regression and that encodings trained for visual similarity
can be used instead. In order to test our hypothesis, we take a shallow
architecture of several fully connected layers and train it with pre-computed
encodings from a generic image retrieval model. We find that these encodings
are not only sufficient to regress the camera pose, but that, when provided to
a branching fully connected architecture, a trained model can achieve
competitive results and even surpass current \textit{state-of-the-art} pose
regressors in some cases. Moreover, we show that for outdoor localization, the
proposed architecture is the only pose regressor, to date, consistently
localizing in under 2 meters and 5 degrees.
Related papers
- No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images [100.80376573969045]
NoPoSplat is a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from multi-view images.
Our model achieves real-time 3D Gaussian reconstruction during inference.
This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios.
arXiv Detail & Related papers (2024-10-31T17:58:22Z) - Map-Relative Pose Regression for Visual Re-Localization [20.89982939633994]
We present a new approach to pose regression, map-relative pose regression (marepo)
We condition the pose regressor on a scene-specific map representation such that its pose predictions are relative to the scene map.
Our approach outperforms previous pose regression methods by far on two public datasets, indoor and outdoor.
arXiv Detail & Related papers (2024-04-15T15:53:23Z) - Coarse-to-Fine Multi-Scene Pose Regression with Transformers [19.927662512903915]
A convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference at a time.
We propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention.
Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors.
arXiv Detail & Related papers (2023-08-22T20:43:31Z) - Human Pose as Compositional Tokens [88.28348144244131]
We present a structured representation, named Pose as Compositional Tokens (PCT), to explore the joint dependency.
It represents a pose by M discrete tokens with each characterizing a sub-structure with several interdependent joints.
A pre-learned decoder network is used to recover the pose from the tokens without further post-processing.
arXiv Detail & Related papers (2023-03-21T07:14:18Z) - A Probabilistic Framework for Visual Localization in Ambiguous Scenes [64.13544430239267]
We propose a probabilistic framework that for a given image predicts the arbitrarily shaped posterior distribution of its camera pose.
We do this via a novel formulation of camera pose regression using variational inference, which allows sampling from the predicted distribution.
Our method outperforms existing methods on localization in ambiguous scenes.
arXiv Detail & Related papers (2023-01-05T14:46:54Z) - Camera Pose Auto-Encoders for Improving Pose Regression [6.700873164609009]
We introduce Camera Pose Auto-Encoders (PAEs) to encode camera poses using APRs as their teachers.
We show that the resulting latent pose representations can closely reproduce APR performance and demonstrate their effectiveness for related tasks.
We also show that train images can be reconstructed from the learned pose encoding, paving the way for integrating visual information from the train set at a low memory cost.
arXiv Detail & Related papers (2022-07-12T13:47:36Z) - Neural Rendering of Humans in Novel View and Pose from Monocular Video [68.37767099240236]
We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input.
Our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.
arXiv Detail & Related papers (2022-04-04T03:09:20Z) - Visual Camera Re-Localization Using Graph Neural Networks and Relative
Pose Supervision [31.947525258453584]
Visual re-localization means using a single image as input to estimate the camera's location and orientation relative to a pre-recorded environment.
Our proposed method makes few special assumptions, and is fairly lightweight in training and testing.
We validate the effectiveness of our approach on both standard indoor (7-Scenes) and outdoor (Cambridge Landmarks) camera re-localization benchmarks.
arXiv Detail & Related papers (2021-04-06T14:29:03Z) - Learning Multi-Scene Absolute Pose Regression with Transformers [4.232614032390374]
A convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time.
We propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention.
We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it surpasses both multi-scene and state-of-the-art single-scene absolute pose regressors.
arXiv Detail & Related papers (2021-03-21T19:21:44Z) - Back to the Feature: Learning Robust Camera Localization from Pixels to
Pose [114.89389528198738]
We introduce PixLoc, a scene-agnostic neural network that estimates an accurate 6-DoF pose from an image and a 3D model.
The system can localize in large environments given coarse pose priors but also improve the accuracy of sparse feature matching.
arXiv Detail & Related papers (2021-03-16T17:40:12Z) - 6D Camera Relocalization in Ambiguous Scenes via Continuous Multimodal
Inference [67.70859730448473]
We present a multimodal camera relocalization framework that captures ambiguities and uncertainties.
We predict multiple camera pose hypotheses as well as the respective uncertainty for each prediction.
We introduce a new dataset specifically designed to foster camera localization research in ambiguous environments.
arXiv Detail & Related papers (2020-04-09T20:55:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.