Related papers: LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space

LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space

URL: http://arxiv.org/abs/2503.18142v2
Date: Fri, 31 Oct 2025 21:31:05 GMT
Title: LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space
Authors: Zhangyu Wang, Zeping Liu, Jielu Zhang, Zhongliang Zhou, Qian Cao, Nemin Wu, Lan Mu, Yang Song, Yiqun Xie, Ni Lao, Gengchen Mai,
Abstract summary: LocDiff is the first image geolocalization model that performs latent diffusion in a multi-scale location encoding space.<n>We show that LocDiff can outperform all state-of-the-art grid-based, retrieval-based, and diffusion-based baselines across 5 challenging global-scale image geolocalization datasets.
Score: 20.664043071378273
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. State-of-the-art methods employ either grid-based classification or gallery-based image-location retrieval, whose spatial generalizability significantly suffers if the spatial distribution of test im- ages does not align with the choices of grids and galleries. Recently emerging generative approaches, while getting rid of grids and galleries, use raw geographical coordinates and suffer quality losses due to their lack of multi-scale information. To address these limitations, we propose a multi-scale latent diffusion model called LocDiff for image geolocalization. We developed a novel positional encoding-decoding framework called Spherical Harmonics Dirac Delta (SHDD) Representations, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking on spherical probability distributions. We also propose a novel SirenNet-based architecture (CS-UNet) to learn an image-based conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. To the best of our knowledge, LocDiff is the first image geolocalization model that performs latent diffusion in a multi-scale location encoding space and generates geolocations under the guidance of images. Experimental results show that LocDiff can outperform all state-of-the-art grid-based, retrieval-based, and diffusion-based baselines across 5 challenging global-scale image geolocalization datasets, and demonstrates significantly stronger generalizability to unseen geolocations.

Related papers

Cross-view geo-localization, Image retrieval, Multiscale geometric modeling, Frequency domain enhancement [1.6686955491488273]
Cross-view geo-localization (CVGL) aims to establish spatial correspondences between images captured from significantly different viewpoints.<n>CVGL remains challenging due to severe geometric asymmetry, texture inconsistency across imaging domains, and the progressive degradation of discriminative local information.<n>This paper proposes the Spatial and Frequency Domain Enhancement Network (SFDE), which leverages complementary representations from spatial and frequency domains.
arXiv Detail & Related papers (2026-03-03T08:25:35Z)
HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation [12.392226207474662]
We introduce an entity-centric formulation of geolocation that replaces image-to-image retrieval with a compact hierarchy of geographic entities embedded in Hyperbolic space.<n>Images are aligned directly to country, region, subregion, and city entities through Geo-Weighted Hyperbolic contrastive learning by directly incorporating haversine distance into the contrastive objective.<n>Compared to the current methods in the literature, it reduces mean geodesic error by 19.5%, while improving the fine-grained subregion accuracy by 43%.
arXiv Detail & Related papers (2026-01-30T15:16:07Z)
Scaling Image Geo-Localization to Continent Level [48.7766435870634]
This paper introduces a hybrid approach that achieves fine-grained geo-localization across a large geographic expanse the size of a continent.<n>We leverage a proxy classification task during training to learn rich feature representations that implicitly encode precise location information.<n>Our evaluation demonstrates that our approach can localize within 200m more than 68% of queries of a dataset covering a large part of Europe.
arXiv Detail & Related papers (2025-10-30T17:59:35Z)
Towards Interpretable Geo-localization: a Concept-Aware Global Image-GPS Alignment Framework [9.31168320050859]
Geo-localization involves determining the exact geographic location of images captured globally.<n>Current concept-based interpretability methods fail to align effectively with Geo-alignment image-location embedding objectives.<n>To our knowledge, this is the first work to introduce interpretability into geo-localization.
arXiv Detail & Related papers (2025-09-02T03:07:26Z)
GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization [70.65458151146767]
Cross-view localization is crucial for large-scale outdoor applications like autonomous navigation and augmented reality.<n>Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations.<n>We propose GeoDistill, a framework that uses teacher-student learning with Field-of-View (FoV)-based masking.
arXiv Detail & Related papers (2025-07-15T03:00:15Z)
Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation [19.028122299569052]
Global visual geolocation predicts where an image was captured on Earth.<n>In this paper, we aim to close the gap between traditional geolocalization and modern generative methods.<n>Our model achieves state-of-the-art performance on three visual geolocation benchmarks.
arXiv Detail & Related papers (2024-12-09T18:59:04Z)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework. By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information. Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z)
Enhancing Worldwide Image Geolocation by Ensembling Satellite-Based Ground-Level Attribute Predictors [4.415977307120618]
We examine the challenge of estimating the location of a single ground-level image in the absence of GPS or other location metadata. We introduce a novel metric, Recall vs Area, which measures the accuracy of estimated distributions of locations. We then examine an ensembling approach to global-scale image geolocation, which incorporates information from multiple sources.
arXiv Detail & Related papers (2024-07-18T19:15:52Z)
GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization [61.10806364001535]
Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth. Existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task. We propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations.
arXiv Detail & Related papers (2023-09-27T20:54:56Z)
G^3: Geolocation via Guidebook Grounding [92.46774241823562]
We study explicit knowledge from human-written guidebooks that describe the salient and class-discriminative visual features humans use for geolocation. We propose the task of Geolocation via Guidebook Grounding that uses a dataset of StreetView images from a diverse set of locations. Our approach substantially outperforms a state-of-the-art image-only geolocation method, with an improvement of over 5% in Top-1 accuracy.
arXiv Detail & Related papers (2022-11-28T16:34:40Z)
Accurate 3-DoF Camera Geo-Localization via Ground-to-Satellite Image Matching [102.39635336450262]
We address the problem of ground-to-satellite image geo-localization by matching a query image captured at the ground level against a large-scale database with geotagged satellite images. Our new method is able to achieve the fine-grained location of a query image, up to pixel size precision of the satellite image.
arXiv Detail & Related papers (2022-03-26T20:10:38Z)
Low-Rank Subspaces in GANs [101.48350547067628]
This work introduces low-rank subspaces that enable more precise control of GAN generation. LowRankGAN is able to find the low-dimensional representation of attribute manifold. Experiments on state-of-the-art GAN models (including StyleGAN2 and BigGAN) trained on various datasets demonstrate the effectiveness of our LowRankGAN.
arXiv Detail & Related papers (2021-06-08T16:16:32Z)
Hierarchical Attention Fusion for Geo-Localization [7.544917072241684]
We introduce a hierarchical attention fusion network using multi-scale features for geo-localization. We extract the hierarchical feature maps from a convolutional neural network (CNN) and organically fuse the extracted features for image representations. Our training is self-supervised using adaptive weights to control the attention of feature emphasis from each hierarchical level.
arXiv Detail & Related papers (2021-02-18T07:07:03Z)
Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization [54.00111565818903]
Cross-view geo-localization is to spot images of the same geographic target from different platforms. Existing methods usually concentrate on mining the fine-grained feature of the geographic target in the image center. We introduce a simple and effective deep neural network, called Local Pattern Network (LPN), to take advantage of contextual information.
arXiv Detail & Related papers (2020-08-26T16:06:11Z)
Zero-Shot Multi-View Indoor Localization via Graph Location Networks [66.05980368549928]
indoor localization is a fundamental problem in location-based applications. We propose a novel neural network based architecture Graph Location Networks (GLN) to perform infrastructure-free, multi-view image based indoor localization. GLN makes location predictions based on robust location representations extracted from images through message-passing networks. We introduce a novel zero-shot indoor localization setting and tackle it by extending the proposed GLN to a dedicated zero-shot version.
arXiv Detail & Related papers (2020-08-06T07:36:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.