Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach
- URL: http://arxiv.org/abs/2601.00388v2
- Date: Mon, 05 Jan 2026 18:27:19 GMT
- Title: Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach
- Authors: Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, Jun Wang,
- Abstract summary: We present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates.<n>We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision.<n>Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference.
- Score: 41.001581773172695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.
Related papers
- GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics [91.17301794848025]
This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions.<n>Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies.
arXiv Detail & Related papers (2026-02-13T04:48:05Z) - RegionReasoner: Region-Grounded Multi-Round Visual Reasoning [69.75509909581133]
RegionReasoner is a reinforcement learning framework for visual reasoning.<n>It enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes.<n>RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment.
arXiv Detail & Related papers (2026-02-03T16:52:16Z) - Towards Interpretable Geo-localization: a Concept-Aware Global Image-GPS Alignment Framework [9.31168320050859]
Geo-localization involves determining the exact geographic location of images captured globally.<n>Current concept-based interpretability methods fail to align effectively with Geo-alignment image-location embedding objectives.<n>To our knowledge, this is the first work to introduce interpretability into geo-localization.
arXiv Detail & Related papers (2025-09-02T03:07:26Z) - GeoSR: Cognitive-Agentic Framework for Probing Geospatial Knowledge Boundaries via Iterative Self-Refinement [4.026524042818433]
GeoSR is a self-refining agentic reasoning framework that embeds core geographic principles into an iterative prediction loop.<n>We validate GeoSR on tasks ranging from physical-world property estimation to socioeconomic prediction.
arXiv Detail & Related papers (2025-08-06T04:45:34Z) - Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models [47.98900725310249]
New pipeline constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images.<n>GLOBE incorporates task-specific rewards that jointly enhance localizability assessment, visual-cue reasoning, and geolocation accuracy.<n>Results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks.
arXiv Detail & Related papers (2025-06-17T16:07:58Z) - GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains [20.788130896943663]
Geo Reason Enhancement (GRE) Suite is a novel framework that augments Visual Language Models with structured reasoning chains for interpretable location inference.<n>First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis.<n>Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision.
arXiv Detail & Related papers (2025-05-24T13:48:57Z) - Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework.
By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information.
Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z) - GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark [56.08664336835741]
We propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE.
We collect data from open-released geographic resources and introduce six natural language understanding tasks.
We pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
arXiv Detail & Related papers (2023-05-11T03:21:56Z) - Cross-View Visual Geo-Localization for Outdoor Augmented Reality [11.214903134756888]
We address the problem of geo-pose estimation by cross-view matching of query ground images to a geo-referenced aerial satellite image database.
We propose a new transformer neural network-based model and a modified triplet ranking loss for joint location and orientation estimation.
Experiments on several benchmark cross-view geo-localization datasets show that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-03-28T01:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.