Related papers: GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

URL: http://arxiv.org/abs/2511.15705v1
Date: Wed, 19 Nov 2025 18:59:22 GMT
Title: GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
Authors: Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu, Yongming Rao,
Abstract summary: Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools.<n>In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses.<n>Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench.<n>We propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related
Score: 53.080882980294795
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.

Related papers

SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning [31.665287327579026]
SpotAgent is a framework that formalizes geo-localization into an agentic reasoning process.<n>It actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram.<n>It achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.
arXiv Detail & Related papers (2026-02-10T06:57:12Z)
Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization [26.98749852286485]
We equip the model textitThinking with Map with agent-in-the-map loop ability and formulate it as an agent-in-the-map loop.<n>We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS)<n>To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images.
arXiv Detail & Related papers (2026-01-08T23:47:30Z)
GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks [15.659980269049798]
Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata.<n>Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes.<n>We propose textbfGraphGeo, a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization.
arXiv Detail & Related papers (2025-11-02T11:58:55Z)
Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models [47.98900725310249]
New pipeline constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images.<n>GLOBE incorporates task-specific rewards that jointly enhance localizability assessment, visual-cue reasoning, and geolocation accuracy.<n>Results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks.
arXiv Detail & Related papers (2025-06-17T16:07:58Z)
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [64.86209459039313]
ThinkGeo is an agentic benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs on 486 structured agentic tasks with 1,773 expert-verified reasoning steps.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework. By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information. Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z)
Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks. This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z)
GOMAA-Geo: GOal Modality Agnostic Active Geo-localization [49.599465495973654]
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities. GOMAA-Geo is a goal modality active geo-localization agent for zero-shot generalization between different goal modalities.
arXiv Detail & Related papers (2024-06-04T02:59:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.