Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning
- URL: http://arxiv.org/abs/2510.00072v1
- Date: Mon, 29 Sep 2025 21:34:55 GMT
- Title: Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning
- Authors: Chenhui Xu, Fuxun Yu, Michael J. Bianco, Jacob Kovarskiy, Raphael Tang, Qi Zhang, Zirui Xu, Will LeVine, Brandon Dubbs, Heming Liao, Cassandra Burgess, Suvam Bag, Jay Patravali, Rupanjali Kukal, Mikael Figueroa, Rishi Madhok, Nikolaos Karianakis, Jinjun Xiong,
- Abstract summary: We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models.<n>In the scaffolding stage, Geo-R1 instills a geospatial thinking paradigm" via supervised fine-tuning on synthetic chain-of-thought exemplars.<n>In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy.
- Score: 26.869573782008217
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models by combining thinking scaffolding and elevating. In the scaffolding stage, Geo-R1 instills a ``geospatial thinking paradigm" via supervised fine-tuning on synthetic chain-of-thought exemplars, enabling models to connect visual cues with geographic priors without costly human reasoning annotations. In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy. This design supplies a verifiable and scalable reward signal: teaching models to capture and reconcile features across modalities, and harnessing reasoning for accurate prediction. Geo-R1 extends geospatial modeling from domain pretraining / supervised finetuning to reasoning-first post-training, and achieves state-of-the-art performance across various geospatial reasoning benchmarks. Our model is available at https://huggingface.co/miniHui/Geo-R1.
Related papers
- OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents [68.85365034738534]
We introduce a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces.<n>The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions.<n>The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split.
arXiv Detail & Related papers (2026-02-19T18:59:54Z) - GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics [91.17301794848025]
This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions.<n>Previous RL-based methods have achieved breakthroughs in performance and interpretability but still remain concerns because of their reliance on AI-generated chain-of-thought (CoT) data and training strategies.
arXiv Detail & Related papers (2026-02-13T04:48:05Z) - Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach [41.001581773172695]
We present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates.<n>We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision.<n>Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference.
arXiv Detail & Related papers (2026-01-01T16:51:41Z) - On the Impact of Graph Neural Networks in Recommender Systems: A Topological Perspective [49.391877616394765]
In recommender systems, user-item interactions can be modeled as a bipartite graph, where user and item nodes are connected by undirected edges.<n>This graph-based view has motivated the rapid adoption of graph neural networks (GNNs)<n>Despite their empirical success, the reasons why GNNs offer systematic advantages over other approaches remain only partially understood.
arXiv Detail & Related papers (2025-12-08T10:19:43Z) - GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes [84.52881742231152]
Multimodal large language models (MLLMs) have undergone rapid development in advancing geospatial scene understanding.<n>Recent studies have sought to enhance the reasoning capabilities of remote sensing MLLMs, typically through cold-start training with elaborately curated chain-of-thought (CoT) data.<n>We propose GeoZero, a framework that enables MLLMs to perform geospatial reasoning without any predefined CoT supervision.
arXiv Detail & Related papers (2025-11-27T17:28:09Z) - GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization [53.080882980294795]
Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools.<n>In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses.<n>Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench.<n>We propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related
arXiv Detail & Related papers (2025-11-19T18:59:22Z) - Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models [8.021952962029165]
Vision-Language Models (VLMs) in remote sensing often fail at complex analytical tasks.<n>We introduce the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT)<n>Geo-CoT is a framework that models remote sensing analysis as a verifiable, multi-step process.
arXiv Detail & Related papers (2025-09-26T11:34:42Z) - Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning [37.90271368636318]
Referring expression understanding in remote sensing poses unique challenges.<n>We propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring.<n>We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines.
arXiv Detail & Related papers (2025-09-26T07:01:12Z) - GLEAM: Learning to Match and Explain in Cross-View Geo-Localization [66.11208984986813]
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location.<n>We present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery.<n>To address the lack of interpretability in traditional CVGL methods, we propose GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning.
arXiv Detail & Related papers (2025-09-09T07:14:31Z) - Reinforcing Video Reasoning Segmentation to Think Before It Segments [67.5703457389657]
We introduce Veason-R1, a specialized LVLM for video reasoning segmentation.<n>Veason-R1 is trained through Group Relative Policy Optimization (O) augmented with Chain-of-Thought trajectories.<n>We incorporate a holistic reward mechanism that enhances spatial alignment and temporal consistency.<n>Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins.
arXiv Detail & Related papers (2025-08-15T15:34:56Z) - RAG for Geoscience: What We Expect, Gaps and Opportunities [15.069356714106808]
Retrieval-Augmented Generation (RAG) enhances language models by combining retrieval with generation.<n>We envision Geo-RAG, a next-generation paradigm that reimagines RAG as a modular retrieve $rightarrow$ reason $rightarrow$ generate $rightarrow$ verify loop.<n>Geo-RAG supports four core capabilities: (i) retrieval of multi-modal Earth data; (ii) reasoning under physical and domain constraints; (iii) generation of science-grade artifacts; and (iv) verification of generated hypotheses against numerical models, ground measurements, and expert assessments.
arXiv Detail & Related papers (2025-08-15T06:33:27Z) - GeoSR: Cognitive-Agentic Framework for Probing Geospatial Knowledge Boundaries via Iterative Self-Refinement [4.026524042818433]
GeoSR is a self-refining agentic reasoning framework that embeds core geographic principles into an iterative prediction loop.<n>We validate GeoSR on tasks ranging from physical-world property estimation to socioeconomic prediction.
arXiv Detail & Related papers (2025-08-06T04:45:34Z) - EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation [50.433911327489554]
We introduce EarthMapper, a novel framework for controllable satellite-map translation.<n>We also contribute CNSatMap, a large-scale dataset comprising 302,132 precisely aligned satellite-map pairs across 38 Chinese cities.<n> experiments on CNSatMap and the New York dataset demonstrate EarthMapper's superior performance.
arXiv Detail & Related papers (2025-04-28T02:41:12Z) - GOMAA-Geo: GOal Modality Agnostic Active Geo-localization [49.599465495973654]
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities.
GOMAA-Geo is a goal modality active geo-localization agent for zero-shot generalization between different goal modalities.
arXiv Detail & Related papers (2024-06-04T02:59:36Z) - Assessment of a new GeoAI foundation model for flood inundation mapping [4.312965283062856]
This paper evaluates the performance of the first-of-its-kind geospatial foundation model, IBM-NASA's Prithvi, to support a crucial geospatial analysis task: flood inundation mapping.
A benchmark dataset, Sen1Floods11, is used in the experiments, and the models' predictability, generalizability, and transferability are evaluated.
Results show the good transferability of the Prithvi model, highlighting its performance advantages in segmenting flooded areas in previously unseen regions.
arXiv Detail & Related papers (2023-09-25T19:50:47Z) - A General Purpose Neural Architecture for Geospatial Systems [142.43454584836812]
We present a roadmap towards the construction of a general-purpose neural architecture (GPNA) with a geospatial inductive bias.
We envision how such a model may facilitate cooperation between members of the community.
arXiv Detail & Related papers (2022-11-04T09:58:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.