Related papers: Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

URL: http://arxiv.org/abs/2601.05432v1
Date: Thu, 08 Jan 2026 23:47:30 GMT
Title: Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization
Authors: Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, Xiangxiang Chu,
Abstract summary: We equip the model textitThinking with Map with agent-in-the-map loop ability and formulate it as an agent-in-the-map loop.<n>We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS)<n>To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images.
Score: 26.98749852286485
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.

Related papers

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization [53.080882980294795]
Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools.<n>In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses.<n>Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench.<n>We propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related
arXiv Detail & Related papers (2025-11-19T18:59:22Z)
GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI [52.13138825802668]
GeoFMs are transforming Earth Observation, but evaluation lacks standardized protocols.<n> GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation.<n>Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
arXiv Detail & Related papers (2025-11-19T17:45:02Z)
GeoToken: Hierarchical Geolocalization of Images via Next Token Prediction [23.767061975974134]
We propose a hierarchical sequence prediction approach inspired by how humans narrow down locations from broad regions to specific addresses.<n>Our method uses S2 cells, a nested, multiresolution global grid, and sequentially predicts finer-level cells conditioned on visual inputs and previous predictions.<n>We evaluate our method on the Im2GPS3k and YFCC4k datasets against two distinct sets of baselines.
arXiv Detail & Related papers (2025-11-02T21:30:06Z)
Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales [61.03549470159347]
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
arXiv Detail & Related papers (2025-10-13T01:12:21Z)
GLEAM: Learning to Match and Explain in Cross-View Geo-Localization [66.11208984986813]
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location.<n>We present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery.<n>To address the lack of interpretability in traditional CVGL methods, we propose GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning.
arXiv Detail & Related papers (2025-09-09T07:14:31Z)
GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization [21.941170274245223]
Image geolocalization aims to predict the geographic location of images captured anywhere on Earth.<n>Current evaluation methodologies suffer from two major limitations.<n>We propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks.
arXiv Detail & Related papers (2025-09-04T15:52:04Z)
OSMLoc: Single Image-Based Visual Localization in OpenStreetMap with Fused Geometric and Semantic Guidance [20.043977909592115]
OSMLoc is a brain-inspired visual localization approach based on first-person-view images against the OpenStreetMap maps.<n>It integrates semantic and geometric guidance to significantly improve accuracy, robustness, and generalization capability.
arXiv Detail & Related papers (2024-11-13T14:59:00Z)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework. By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information. Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z)
GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods. Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z)
Image-based Geolocalization by Ground-to-2.5D Map Matching [21.21416396311102]
Methods often utilize cross-view localization techniques to match ground-view query images with 2D maps. We propose a new approach to learning representative embeddings from multi-modal data. By encoding crucial geometric cues, our method learns discriminative location embeddings for matching panoramic images and maps.
arXiv Detail & Related papers (2023-08-11T08:00:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.