Loc$^2$: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching
- URL: http://arxiv.org/abs/2509.09792v2
- Date: Mon, 29 Sep 2025 14:04:01 GMT
- Title: Loc$^2$: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching
- Authors: Zimin Xia, Chenghao Xu, Alexandre Alahi,
- Abstract summary: We propose an accurate and interpretable fine-grained cross-view localization method.<n>It estimates the 3 Degrees of Freedom (DoF) pose of a ground-level image by matching its local features with a reference aerial image.<n> Experiments show state-of-the-art accuracy in challenging scenarios such as cross-area testing and unknown orientation.
- Score: 80.57282092735991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an accurate and interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom (DoF) pose of a ground-level image by matching its local features with a reference aerial image. Unlike prior approaches that rely on global descriptors or bird's-eye-view (BEV) transformations, our method directly learns ground-aerial image-plane correspondences using weak supervision from camera poses. The matched ground points are lifted into BEV space with monocular depth predictions, and scale-aware Procrustes alignment is then applied to estimate camera rotation, translation, and optionally the scale between relative depth and the aerial metric space. This formulation is lightweight, end-to-end trainable, and requires no pixel-level annotations. Experiments show state-of-the-art accuracy in challenging scenarios such as cross-area testing and unknown orientation. Furthermore, our method offers strong interpretability: correspondence quality directly reflects localization accuracy and enables outlier rejection via RANSAC, while overlaying the re-scaled ground layout on the aerial image provides an intuitive visual cue of localization accuracy.
Related papers
- Revisiting Cross-View Localization from Image Matching [12.411420734642988]
Cross-view localization aims to estimate the 3 degrees of freedom pose of a ground-view image by registering it to aerial or satellite imagery.<n>Existing methods either regress poses directly or align features in a shared bird's-eye view (BEV) space.<n>We propose a novel framework that improves both matching and localization.
arXiv Detail & Related papers (2025-08-14T14:57:31Z) - FG$^2$: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching [69.81167130510333]
We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings.<n>The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image.<n>Compared to the previous state-of-the-art, our method reduces the mean localization error by 28% on the VIGOR cross-area test set.
arXiv Detail & Related papers (2025-03-24T14:34:20Z) - A Novel Solution for Drone Photogrammetry with Low-overlap Aerial Images using Monocular Depth Estimation [6.689484367905018]
Low-overlap aerial imagery poses significant challenges to traditional photogrammetric methods.<n>We propose a novel workflow based on monocular depth estimation to address the limitations of conventional techniques.
arXiv Detail & Related papers (2025-03-06T14:59:38Z) - BevSplat: Resolving Height Ambiguity via Feature-Based Gaussian Primitives for Weakly-Supervised Cross-View Localization [19.10240177449607]
This paper addresses the problem of weakly supervised cross-view localization.<n>The goal is to estimate the pose of a ground camera relative to a satellite image with noisy ground truth annotations.<n>We propose BevSplat, a novel method that resolves height ambiguity by using feature-based Gaussian primitives.
arXiv Detail & Related papers (2025-02-13T08:54:04Z) - Learning Dense Flow Field for Highly-accurate Cross-view Camera
Localization [15.89357790711828]
This paper addresses the problem of estimating the 3-DoF camera pose for a ground-level image with respect to a satellite image.
We propose a novel end-to-end approach that leverages the learning of dense pixel-wise flow fields in pairs of ground and satellite images.
Our approach reduces the median localization error by 89%, 19%, 80% and 35% on the KITTI, Ford multi-AV, VIGOR and Oxford RobotCar datasets.
arXiv Detail & Related papers (2023-09-27T10:26:26Z) - View Consistent Purification for Accurate Cross-View Localization [59.48131378244399]
This paper proposes a fine-grained self-localization method for outdoor robotics.
The proposed method addresses limitations in existing cross-view localization methods.
It is the first sparse visual-only method that enhances perception in dynamic environments.
arXiv Detail & Related papers (2023-08-16T02:51:52Z) - Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via
Geometry-Guided Cross-View Transformer [66.82008165644892]
We propose a method to increase the accuracy of a ground camera's location and orientation by estimating the relative rotation and translation between the ground-level image and its matched/retrieved satellite image.
Experimental results demonstrate that our method significantly outperforms the state-of-the-art.
arXiv Detail & Related papers (2023-07-16T11:52:27Z) - Convolutional Cross-View Pose Estimation [9.599356978682108]
We propose a novel end-to-end method for cross-view pose estimation.
Our method is validated on the VIGOR and KITTI datasets.
On the Oxford RobotCar dataset, our method can reliably estimate the ego-vehicle's pose over time.
arXiv Detail & Related papers (2023-03-09T13:52:28Z) - Beyond Cross-view Image Retrieval: Highly Accurate Vehicle Localization
Using Satellite Image [91.29546868637911]
This paper addresses the problem of vehicle-mounted camera localization by matching a ground-level image with an overhead-view satellite map.
The key idea is to formulate the task as pose estimation and solve it by neural-net based optimization.
Experiments on standard autonomous vehicle localization datasets have confirmed the superiority of the proposed method.
arXiv Detail & Related papers (2022-04-10T19:16:58Z) - Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular
Video Depth [90.33296913575818]
In some video-based scenarios such as video depth estimation and 3D scene reconstruction from a video, the unknown scale and shift residing in per-frame prediction may cause the depth inconsistency.
We propose a locally weighted linear regression method to recover the scale and shift with very sparse anchor points.
Our method can boost the performance of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks.
arXiv Detail & Related papers (2022-02-03T08:52:54Z) - Robust Consistent Video Depth Estimation [65.53308117778361]
We present an algorithm for estimating consistent dense depth maps and camera poses from a monocular video.
Our algorithm combines two complementary techniques: (1) flexible deformation-splines for low-frequency large-scale alignment and (2) geometry-aware depth filtering for high-frequency alignment of fine depth details.
In contrast to prior approaches, our method does not require camera poses as input and achieves robust reconstruction for challenging hand-held cell phone captures containing a significant amount of noise, shake, motion blur, and rolling shutter deformations.
arXiv Detail & Related papers (2020-12-10T18:59:48Z) - Self-Supervised Learning for Monocular Depth Estimation from Aerial
Imagery [0.20072624123275526]
We present a method for self-supervised learning for monocular depth estimation from aerial imagery.
For this, we only use an image sequence from a single moving camera and learn to simultaneously estimate depth and pose information.
By sharing the weights between pose and depth estimation, we achieve a relatively small model, which favors real-time application.
arXiv Detail & Related papers (2020-08-17T12:20:46Z) - Where am I looking at? Joint Location and Orientation Estimation by
Cross-View Matching [95.64702426906466]
Cross-view geo-localization is a problem given a large-scale database of geo-tagged aerial images.
Knowing orientation between ground and aerial images can significantly reduce matching ambiguity between these two views.
We design a Dynamic Similarity Matching network to estimate cross-view orientation alignment during localization.
arXiv Detail & Related papers (2020-05-08T05:21:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.