CV-Cities: Advancing Cross-View Geo-Localization in Global Cities
- URL: http://arxiv.org/abs/2411.12431v1
- Date: Tue, 19 Nov 2024 11:41:22 GMT
- Title: CV-Cities: Advancing Cross-View Geo-Localization in Global Cities
- Authors: Gaoshuang Huang, Yang Zhou, Luying Zhao, Wenjian Gan,
- Abstract summary: Cross-view geo-localization (CVGL) involves matching and retrieving satellite images to determine the geographic location of a ground image.
This task faces significant challenges due to substantial viewpoint discrepancies, the complexity of localization scenarios, and the need for global localization.
We propose a novel CVGL framework that integrates the foundational model DINOv2 with an advanced feature mixer.
- Score: 3.074201632920997
- License:
- Abstract: Cross-view geo-localization (CVGL), which involves matching and retrieving satellite images to determine the geographic location of a ground image, is crucial in GNSS-constrained scenarios. However, this task faces significant challenges due to substantial viewpoint discrepancies, the complexity of localization scenarios, and the need for global localization. To address these issues, we propose a novel CVGL framework that integrates the vision foundational model DINOv2 with an advanced feature mixer. Our framework introduces the symmetric InfoNCE loss and incorporates near-neighbor sampling and dynamic similarity sampling strategies, significantly enhancing localization accuracy. Experimental results show that our framework surpasses existing methods across multiple public and self-built datasets. To further improve globalscale performance, we have developed CV-Cities, a novel dataset for global CVGL. CV-Cities includes 223,736 ground-satellite image pairs with geolocation data, spanning sixteen cities across six continents and covering a wide range of complex scenarios, providing a challenging benchmark for CVGL. The framework trained with CV-Cities demonstrates high localization accuracy in various test cities, highlighting its strong globalization and generalization capabilities. Our datasets and codes are available at https://github.com/GaoShuang98/CVCities.
Related papers
- Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.
GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.
We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z) - Cross-View Geo-Localization with Street-View and VHR Satellite Imagery in Decentrality Settings [39.252555758596706]
Cross-View Geo-Localization matches street-view query images with geo-tagged aerial-view reference images.
Decentrality is a critical factor warranting deeper investigation, as larger decentrality can substantially improve localization efficiency but comes at the cost of declines in localization accuracy.
We introduce DReSS, a novel dataset designed to evaluate cross-view geo-localization with a large geographic scope and diverse landscapes.
arXiv Detail & Related papers (2024-12-16T08:07:53Z) - World-Consistent Data Generation for Vision-and-Language Navigation [52.08816337783936]
Vision-and-Language Navigation (VLN) is a challenging task that requires an agent to navigate through photorealistic environments following natural-language instructions.
One main obstacle existing in VLN is data scarcity, leading to poor generalization performance over unseen environments.
We propose the world-consistent data generation (WCGEN), an efficacious data-augmentation framework satisfying both diversity and world-consistency.
arXiv Detail & Related papers (2024-12-09T11:40:54Z) - Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework.
By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information.
Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z) - SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery [22.716322265391852]
We introduce Satellite Contrastive Location-Image Pretraining (SatCLIP)
SatCLIP learns an implicit representation of locations by matching CNN and ViT inferred visual patterns of openly available satellite imagery with their geographic coordinates.
In experiments, we use SatCLIP embeddings to improve prediction performance on nine diverse location-dependent tasks including temperature prediction, animal recognition, and population density estimation.
arXiv Detail & Related papers (2023-11-28T19:14:40Z) - CurriculumLoc: Enhancing Cross-Domain Geolocalization through
Multi-Stage Refinement [11.108860387261508]
Visual geolocalization is a cost-effective and scalable task that involves matching one or more query images taken at some unknown location, to a set of geo-tagged reference images.
We develop CurriculumLoc, a novel keypoint detection and description with global semantic awareness and a local geometric verification.
We achieve new high recall@1 scores of 62.6% and 94.5% on ALTO, with two different distances metrics, respectively.
arXiv Detail & Related papers (2023-11-20T08:40:01Z) - GeoCLIP: Clip-Inspired Alignment between Locations and Images for
Effective Worldwide Geo-localization [61.10806364001535]
Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth.
Existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task.
We propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations.
arXiv Detail & Related papers (2023-09-27T20:54:56Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image
Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space.
Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.