Related papers: From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance

From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance

URL: http://arxiv.org/abs/2511.09820v1
Date: Fri, 14 Nov 2025 01:11:19 GMT
Title: From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance
Authors: Jeongho Min, Dongyoung Kim, Jaehyup Lee,
Abstract summary: Cross-view image retrieval is critical for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments.<n>We present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM)<n>Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings.
Score: 10.533095161205358
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment. In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement. Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, which is offering a scalable and cost-efficient alternative to manual annotation. All source codes will be made publicly available at https://jeonghomin.github.io/street2orbit.github.io/.

Related papers

Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization [17.908597896653045]
This paper presents a cross-view UAV localization framework that performs map matching via object detection.<n>In typical pipelines, UAV visual localization is formulated as an image-retrieval problem.<n>Our method achieves strong retrieval and localization performance using a fine-grained, graph-based node-similarity metric.
arXiv Detail & Related papers (2025-11-04T11:25:31Z)
GLEAM: Learning to Match and Explain in Cross-View Geo-Localization [66.11208984986813]
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location.<n>We present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery.<n>To address the lack of interpretability in traditional CVGL methods, we propose GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning.
arXiv Detail & Related papers (2025-09-09T07:14:31Z)
AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models [61.350774745321566]
Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level.<n>They struggle with fine-grained street-level localization within urban areas.<n>In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images.
arXiv Detail & Related papers (2025-08-14T14:06:28Z)
GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization [70.65458151146767]
Cross-view localization is crucial for large-scale outdoor applications like autonomous navigation and augmented reality.<n>Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations.<n>We propose GeoDistill, a framework that uses teacher-student learning with Field-of-View (FoV)-based masking.
arXiv Detail & Related papers (2025-07-15T03:00:15Z)
Pole-based Vehicle Localization with Vector Maps: A Camera-LiDAR Comparative Study [6.300346102366891]
In road environments, many common furniture such as traffic signs, traffic lights and street lights take the form of poles.<n>This paper introduces a real-time method for camera-based pole detection using a lightweight neural network trained on automatically annotated images.<n>The results highlight the high accuracy of the vision-based approach in open road conditions.
arXiv Detail & Related papers (2024-12-11T09:05:05Z)
OSMLoc: Single Image-Based Visual Localization in OpenStreetMap with Fused Geometric and Semantic Guidance [20.043977909592115]
OSMLoc is a brain-inspired visual localization approach based on first-person-view images against the OpenStreetMap maps.<n>It integrates semantic and geometric guidance to significantly improve accuracy, robustness, and generalization capability.
arXiv Detail & Related papers (2024-11-13T14:59:00Z)
Game4Loc: A UAV Geo-Localization Benchmark from Game Data [0.0]
We introduce a more practical UAV geo-localization task including partial matches of cross-view paired data.<n>Experiments demonstrate the effectiveness of our data and training method for UAV geo-localization.
arXiv Detail & Related papers (2024-09-25T13:33:28Z)
Weakly-supervised Camera Localization by Ground-to-satellite Image Registration [52.54992898069471]
We propose a weakly supervised learning strategy for ground-to-satellite image registration. It derives positive and negative satellite images for each ground image. We also propose a self-supervision strategy for cross-view image relative rotation estimation.
arXiv Detail & Related papers (2024-09-10T12:57:16Z)
GOMAA-Geo: GOal Modality Agnostic Active Geo-localization [49.599465495973654]
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities. GOMAA-Geo is a goal modality active geo-localization agent for zero-shot generalization between different goal modalities.
arXiv Detail & Related papers (2024-06-04T02:59:36Z)
Visual Cross-View Metric Localization with Dense Uncertainty Estimates [11.76638109321532]
This work addresses visual cross-view metric localization for outdoor robotics. Given a ground-level color image and a satellite patch that contains the local surroundings, the task is to identify the location of the ground camera within the satellite patch. We devise a novel network architecture with denser satellite descriptors, similarity matching at the bottleneck, and a dense spatial distribution as output to capture multi-modal localization ambiguities.
arXiv Detail & Related papers (2022-08-17T20:12:23Z)
Satellite Image Based Cross-view Localization for Autonomous Vehicle [59.72040418584396]
This paper shows that by using an off-the-shelf high-definition satellite image as a ready-to-use map, we are able to achieve cross-view vehicle localization up to a satisfactory accuracy. Our method is validated on KITTI and Ford Multi-AV Seasonal datasets as ground view and Google Maps as the satellite view.
arXiv Detail & Related papers (2022-07-27T13:16:39Z)
Beyond Cross-view Image Retrieval: Highly Accurate Vehicle Localization Using Satellite Image [91.29546868637911]
This paper addresses the problem of vehicle-mounted camera localization by matching a ground-level image with an overhead-view satellite map. The key idea is to formulate the task as pose estimation and solve it by neural-net based optimization. Experiments on standard autonomous vehicle localization datasets have confirmed the superiority of the proposed method.
arXiv Detail & Related papers (2022-04-10T19:16:58Z)
Automatic Signboard Detection and Localization in Densely Populated Developing Cities [0.0]
Signboard detection in natural scene images is the foremost task for error-free information retrieval. We present a novel object detection approach that can detect signboards automatically and is suitable for such cities. Our proposed method can detect signboards accurately (even if the images contain multiple signboards with diverse shapes and colours in a noisy background) achieving 0.90 mAP (mean average precision)
arXiv Detail & Related papers (2020-03-04T08:04:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.