SliceMatch: Geometry-guided Aggregation for Cross-View Pose Estimation
- URL: http://arxiv.org/abs/2211.14651v3
- Date: Tue, 28 Mar 2023 19:16:50 GMT
- Title: SliceMatch: Geometry-guided Aggregation for Cross-View Pose Estimation
- Authors: Ted Lentsch, Zimin Xia, Holger Caesar, Julian F. P. Kooij
- Abstract summary: SliceMatch consists of ground and aerial feature extractors, feature aggregators, and a pose predictor.
We propose SliceMatch, which consists of ground and aerial feature extractors, feature aggregators, and a pose predictor.
- Score: 7.751856268560216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work addresses cross-view camera pose estimation, i.e., determining the
3-Degrees-of-Freedom camera pose of a given ground-level image w.r.t. an aerial
image of the local area. We propose SliceMatch, which consists of ground and
aerial feature extractors, feature aggregators, and a pose predictor. The
feature extractors extract dense features from the ground and aerial images.
Given a set of candidate camera poses, the feature aggregators construct a
single ground descriptor and a set of pose-dependent aerial descriptors.
Notably, our novel aerial feature aggregator has a cross-view attention module
for ground-view guided aerial feature selection and utilizes the geometric
projection of the ground camera's viewing frustum on the aerial image to pool
features. The efficient construction of aerial descriptors is achieved using
precomputed masks. SliceMatch is trained using contrastive learning and pose
estimation is formulated as a similarity comparison between the ground
descriptor and the aerial descriptors. Compared to the state-of-the-art,
SliceMatch achieves a 19% lower median localization error on the VIGOR
benchmark using the same VGG16 backbone at 150 frames per second, and a 50%
lower error when using a ResNet50 backbone.
Related papers
- Cross-View Open-Vocabulary Object Detection in Aerial Imagery [48.851422992413184]
We propose a novel framework for adapting open-vocabulary representations from ground-view images to solve object detection in aerial imagery.<n>The method introduces contrastive image-to-image alignment to enhance the similarity between aerial and ground-view embeddings.<n>Our open-vocabulary model achieves improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone (Images), and +3.46 mAP on HRRSD in the zero-shot setting.
arXiv Detail & Related papers (2025-10-04T16:12:03Z) - Aerial-Ground Image Feature Matching via 3D Gaussian Splatting-based Intermediate View Rendering [7.454339483033969]
The integration of aerial and ground images has been a promising solution in 3D modeling of complex scenes.<n>The primary contribution of this study is a feature matching algorithm for aerial and ground images.
arXiv Detail & Related papers (2025-09-24T08:50:13Z) - Loc$^2$: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching [80.57282092735991]
We propose an accurate and interpretable fine-grained cross-view localization method.<n>It estimates the 3 Degrees of Freedom (DoF) pose of a ground-level image by matching its local features with a reference aerial image.<n> Experiments show state-of-the-art accuracy in challenging scenarios such as cross-area testing and unknown orientation.
arXiv Detail & Related papers (2025-09-11T18:52:16Z) - FG$^2$: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching [69.81167130510333]
We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings.
The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image.
Compared to the previous state-of-the-art, our method reduces the mean localization error by 28% on the VIGOR cross-area test set.
arXiv Detail & Related papers (2025-03-24T14:34:20Z) - MaFreeI2P: A Matching-Free Image-to-Point Cloud Registration Paradigm with Active Camera Pose Retrieval [2.400446821380503]
Image-to-point cloud registration seeks to estimate their relative camera pose.
Recent matching-based methods tend to tackle this by building 2D-3D correspondences.
We propose a matching-free paradigm, named MaFreeI2P.
arXiv Detail & Related papers (2024-08-05T11:39:22Z) - SCENES: Subpixel Correspondence Estimation With Epipolar Supervision [18.648772607057175]
Extracting point correspondences from two or more views of a scene is a fundamental computer vision problem.
Existing local feature matching approaches, trained with correspondence supervision on large-scale datasets, obtain highly-accurate matches on the test sets.
We relax this assumption by removing the requirement of 3D structure, e.g., depth maps or point clouds, and only require camera pose information, which can be obtained from odometry.
arXiv Detail & Related papers (2024-01-19T18:57:46Z) - AG-ReID.v2: Bridging Aerial and Ground Views for Person Re-identification [39.58286453178339]
Aerial-ground person re-identification (Re-ID) presents unique challenges in computer vision.
We introduce AG-ReID.v2, a dataset specifically designed for person Re-ID in mixed aerial and ground scenarios.
This dataset comprises 100,502 images of 1,615 unique individuals, each annotated with matching IDs and 15 soft attribute labels.
arXiv Detail & Related papers (2024-01-05T04:53:33Z) - Learning Dense Flow Field for Highly-accurate Cross-view Camera
Localization [15.89357790711828]
This paper addresses the problem of estimating the 3-DoF camera pose for a ground-level image with respect to a satellite image.
We propose a novel end-to-end approach that leverages the learning of dense pixel-wise flow fields in pairs of ground and satellite images.
Our approach reduces the median localization error by 89%, 19%, 80% and 35% on the KITTI, Ford multi-AV, VIGOR and Oxford RobotCar datasets.
arXiv Detail & Related papers (2023-09-27T10:26:26Z) - PoseMatcher: One-shot 6D Object Pose Estimation by Deep Feature Matching [51.142988196855484]
We propose PoseMatcher, an accurate model free one-shot object pose estimator.
We create a new training pipeline for object to image matching based on a three-view system.
To enable PoseMatcher to attend to distinct input modalities, an image and a pointcloud, we introduce IO-Layer.
arXiv Detail & Related papers (2023-04-03T21:14:59Z) - Occupancy Planes for Single-view RGB-D Human Reconstruction [120.5818162569105]
Single-view RGB-D human reconstruction with implicit functions is often formulated as per-point classification.
We propose the occupancy planes (OPlanes) representation, which enables to formulate single-view RGB-D human reconstruction as occupancy prediction on planes which slice through the camera's view frustum.
arXiv Detail & Related papers (2022-08-04T17:59:56Z) - Beyond Cross-view Image Retrieval: Highly Accurate Vehicle Localization
Using Satellite Image [91.29546868637911]
This paper addresses the problem of vehicle-mounted camera localization by matching a ground-level image with an overhead-view satellite map.
The key idea is to formulate the task as pose estimation and solve it by neural-net based optimization.
Experiments on standard autonomous vehicle localization datasets have confirmed the superiority of the proposed method.
arXiv Detail & Related papers (2022-04-10T19:16:58Z) - Compositional Sketch Search [91.84489055347585]
We present an algorithm for searching image collections using free-hand sketches.
We exploit drawings as a concise and intuitive representation for specifying entire scene compositions.
arXiv Detail & Related papers (2021-06-15T09:38:09Z) - DeepI2P: Image-to-Point Cloud Registration via Deep Classification [71.3121124994105]
DeepI2P is a novel approach for cross-modality registration between an image and a point cloud.
Our method estimates the relative rigid transformation between the coordinate frames of the camera and Lidar.
We circumvent the difficulty by converting the registration problem into a classification and inverse camera projection optimization problem.
arXiv Detail & Related papers (2021-04-08T04:27:32Z) - Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by
Implicitly Unprojecting to 3D [100.93808824091258]
We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras.
Our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a bird's-eye-view grid.
We show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network.
arXiv Detail & Related papers (2020-08-13T06:29:01Z) - Evaluation of Cross-View Matching to Improve Ground Vehicle Localization
with Aerial Perception [17.349420462716886]
Cross-view matching refers to the problem of finding the closest match for a given query ground view image to one from a database of aerial images.
In this paper, we evaluate cross-view matching for the task of localizing a ground vehicle over a longer trajectory.
arXiv Detail & Related papers (2020-03-13T23:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.