SACReg: Scene-Agnostic Coordinate Regression for Visual Localization
- URL: http://arxiv.org/abs/2307.11702v3
- Date: Thu, 30 Nov 2023 11:22:53 GMT
- Title: SACReg: Scene-Agnostic Coordinate Regression for Visual Localization
- Authors: Jerome Revaud, Yohann Cabon, Romain Br\'egier, JongMin Lee and
Philippe Weinzaepfel
- Abstract summary: We propose a generalized SCR model trained once in new test scenes, regardless of their scale, without any finetuning.
Instead of encoding the scene coordinates into the network weights, our model takes as input a database image with some sparse 2D pixel to 3D coordinate annotations.
We show that the database representation of images and their 2D-3D annotations can be highly compressed with negligible loss of localization performance.
- Score: 16.866303169903237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene coordinates regression (SCR), i.e., predicting 3D coordinates for every
pixel of a given image, has recently shown promising potential. However,
existing methods remain limited to small scenes memorized during training, and
thus hardly scale to realistic datasets and scenarios. In this paper, we
propose a generalized SCR model trained once to be deployed in new test scenes,
regardless of their scale, without any finetuning. Instead of encoding the
scene coordinates into the network weights, our model takes as input a database
image with some sparse 2D pixel to 3D coordinate annotations, extracted from
e.g. off-the-shelf Structure-from-Motion or RGB-D data, and a query image for
which are predicted a dense 3D coordinate map and its confidence, based on
cross-attention. At test time, we rely on existing off-the-shelf image
retrieval systems and fuse the predictions from a shortlist of relevant
database images w.r.t. the query. Afterwards camera pose is obtained using
standard Perspective-n-Point (PnP). Starting from selfsupervised CroCo
pretrained weights, we train our model on diverse datasets to ensure
generalizabilty across various scenarios, and significantly outperform other
scene regression approaches, including scene-specific models, on multiple
visual localization benchmarks. Finally, we show that the database
representation of images and their 2D-3D annotations can be highly compressed
with negligible loss of localization performance.
Related papers
- No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images [100.80376573969045]
NoPoSplat is a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from multi-view images.
Our model achieves real-time 3D Gaussian reconstruction during inference.
This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios.
arXiv Detail & Related papers (2024-10-31T17:58:22Z) - GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization [1.4466437171584356]
3D Gaussian Splatting (3DGS) allows for the compact encoding of both 3D geometry and scene appearance with its spatial features.
We propose distilling dense keypoint descriptors into 3DGS to improve the model's spatial understanding.
Our approach surpasses state-of-the-art Neural Render Pose (NRP) methods, including NeRFMatch and PNeRFLoc.
arXiv Detail & Related papers (2024-09-24T23:18:32Z) - GLACE: Global Local Accelerated Coordinate Encoding [66.87005863868181]
Scene coordinate regression methods are effective in small-scale scenes but face significant challenges in large-scale scenes.
We propose GLACE, which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network.
Our method achieves state-of-the-art results on large-scale scenes with a low-map-size model.
arXiv Detail & Related papers (2024-06-06T17:59:50Z) - HSCNet++: Hierarchical Scene Coordinate Classification and Regression
for Visual Localization with Transformer [23.920690073252636]
We present a new hierarchical scene coordinate network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image.
The proposed method, which is an extension of HSCNet, allows us to train compact models which scale robustly to large environments.
arXiv Detail & Related papers (2023-05-05T15:00:14Z) - SGAligner : 3D Scene Alignment with Scene Graphs [84.01002998166145]
Building 3D scene graphs has emerged as a topic in scene representation for several embodied AI applications.
We focus on the fundamental problem of aligning pairs of 3D scene graphs whose overlap can range from zero to partial.
We propose SGAligner, the first method for aligning pairs of 3D scene graphs that is robust to in-the-wild scenarios.
arXiv Detail & Related papers (2023-04-28T14:39:22Z) - Visual Localization using Imperfect 3D Models from the Internet [54.731309449883284]
This paper studies how imperfections in 3D models affect localization accuracy.
We show that 3D models from the Internet show promise as an easy-to-obtain scene representation.
arXiv Detail & Related papers (2023-04-12T16:15:05Z) - Soft Expectation and Deep Maximization for Image Feature Detection [68.8204255655161]
We propose SEDM, an iterative semi-supervised learning process that flips the question and first looks for repeatable 3D points, then trains a detector to localize them in image space.
Our results show that this new model trained using SEDM is able to better localize the underlying 3D points in a scene.
arXiv Detail & Related papers (2021-04-21T00:35:32Z) - Learning Camera Localization via Dense Scene Matching [45.0957383562443]
Camera localization aims to estimate 6 DoF camera poses from RGB images.
Recent learning-based approaches encode structures into a specific convolutional neural network (CNN)
We present a new method for camera localization using dense matching (DSM)
arXiv Detail & Related papers (2021-03-31T03:47:42Z) - Back to the Feature: Learning Robust Camera Localization from Pixels to
Pose [114.89389528198738]
We introduce PixLoc, a scene-agnostic neural network that estimates an accurate 6-DoF pose from an image and a 3D model.
The system can localize in large environments given coarse pose priors but also improve the accuracy of sparse feature matching.
arXiv Detail & Related papers (2021-03-16T17:40:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.