Related papers: GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

URL: http://arxiv.org/abs/2512.02715v1
Date: Tue, 02 Dec 2025 12:45:52 GMT
Title: GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding
Authors: Peirong Zhang, Yidan Zhang, Luxiao Xu, Jinliang Lin, Zonghao Guo, Fengxiang Wang, Xue Yang, Kaiwen Wei, Lei Wang,
Abstract summary: We propose GeoViS, a progressive search-and-reasoning framework for remote sensing visual grounding.<n>We show that GeoViS consistently surpasses existing methods across key visual grounding metrics.
Score: 23.253648429114236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in multimodal large language models(MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects. To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness. Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.

Related papers

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning [68.59084007360615]
We propose GeoThinker, a framework that shifts paradigm passive fusion to active perception.<n>Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands.<n>Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence.
arXiv Detail & Related papers (2026-02-05T18:59:32Z)
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization [53.080882980294795]
Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools.<n>In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses.<n>Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench.<n>We propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related
arXiv Detail & Related papers (2025-11-19T18:59:22Z)
GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration [24.01750902074338]
Active Geo-localization (AGL) is the task of localizing a goal within a predefined search area.<n>Current methods approach AGL as a goal-reaching reinforcement learning problem with a distance-based reward.<n>We propose GeoExplorer, an AGL agent that incorporates curiosity-driven exploration through intrinsic rewards.
arXiv Detail & Related papers (2025-07-31T20:23:25Z)
Dynamic Contrastive Learning for Hierarchical Retrieval: A Case Study of Distance-Aware Cross-View Geo-Localization [20.868592923432843]
Existing deep learning-based cross-view geo-localization methods primarily focus on improving the accuracy of cross-domain image matching.<n>We propose Dynamic Contrastive Learning (DyCL), a novel framework that progressively aligns feature representations according to hierarchical spatial margins.
arXiv Detail & Related papers (2025-06-29T03:57:01Z)
GAIR: Improving Multimodal Geo-Foundation Model with Geo-Aligned Implicit Representations [5.439918212297177]
Key challenge in multimodal GeoFM development is to explicitly model geospatial relationships across modalities.<n>We propose GAIR, a novel multimodal GeoFM architecture integrating overhead RS data, street view imagery, and their geolocation metadata.<n>We evaluate GAIR across 10 geospatial tasks spanning RS image-based, SV image-based, and location embedding-based benchmarks.
arXiv Detail & Related papers (2025-03-20T19:59:39Z)
Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.<n>GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.<n>We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z)
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework. By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information. Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z)
GOMAA-Geo: GOal Modality Agnostic Active Geo-localization [49.599465495973654]
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities. GOMAA-Geo is a goal modality active geo-localization agent for zero-shot generalization between different goal modalities.
arXiv Detail & Related papers (2024-06-04T02:59:36Z)
CurriculumLoc: Enhancing Cross-Domain Geolocalization through Multi-Stage Refinement [11.108860387261508]
Visual geolocalization is a cost-effective and scalable task that involves matching one or more query images taken at some unknown location, to a set of geo-tagged reference images. We develop CurriculumLoc, a novel keypoint detection and description with global semantic awareness and a local geometric verification. We achieve new high recall@1 scores of 62.6% and 94.5% on ALTO, with two different distances metrics, respectively.
arXiv Detail & Related papers (2023-11-20T08:40:01Z)
Cross-View Visual Geo-Localization for Outdoor Augmented Reality [11.214903134756888]
We address the problem of geo-pose estimation by cross-view matching of query ground images to a geo-referenced aerial satellite image database. We propose a new transformer neural network-based model and a modified triplet ranking loss for joint location and orientation estimation. Experiments on several benchmark cross-view geo-localization datasets show that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-03-28T01:58:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.