Related papers: GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes

GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes

URL: http://arxiv.org/abs/2511.22645v1
Date: Thu, 27 Nov 2025 17:28:09 GMT
Title: GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes
Authors: Di Wang, Shunyu Liu, Wentao Jiang, Fengxiang Wang, Yi Liu, Xiaolei Qin, Zhiming Luo, Chaoyang Zhou, Haonan Guo, Jing Zhang, Bo Du, Dacheng Tao, Liangpei Zhang,
Abstract summary: Multimodal large language models (MLLMs) have undergone rapid development in advancing geospatial scene understanding.<n>Recent studies have sought to enhance the reasoning capabilities of remote sensing MLLMs, typically through cold-start training with elaborately curated chain-of-thought (CoT) data.<n>We propose GeoZero, a framework that enables MLLMs to perform geospatial reasoning without any predefined CoT supervision.
Score: 84.52881742231152
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have undergone rapid development in advancing geospatial scene understanding. Recent studies have sought to enhance the reasoning capabilities of remote sensing MLLMs, typically through cold-start training with elaborately curated chain-of-thought (CoT) data. However, this approach not only incurs substantial annotation costs but also introduces human biases that may limit the diversity of model reasoning. To address these challenges, we propose GeoZero, a framework that enables MLLMs to perform geospatial reasoning without any predefined CoT supervision. Specifically, we construct two datasets, GeoZero-Instruct and GeoZero-Hard. GeoZero-Instruct allows the model to acquire preliminary geospatial knowledge through supervised fine-tuning, while GeoZero-Hard stimulates deep reasoning during the subsequent reinforcement learning stage. Furthermore, we introduce Answer-Anchored Group Relative Policy Optimization (A$^2$GRPO), where the reasoning process is regularized by the model's own answers, encouraging diverse yet accurate thinking. Extensive experiments on multiple remote sensing vision-language benchmarks demonstrate that GeoZero not only surpasses existing state-of-the-art methods but also fosters universal emergent reasoning capabilities across diverse geospatial tasks. Code,data,and models will be publicly available at https://github.com/MiliLab/GeoZero.

Related papers

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes [18.631940492768898]
We introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation.<n>GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets.<n>We also present UniGeoSeg, a unified framework that incorporates task-aware text enhancement, latent knowledge memory, and a progressive training strategy.
arXiv Detail & Related papers (2025-11-28T16:40:08Z)
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization [53.080882980294795]
Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools.<n>In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses.<n>Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench.<n>We propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related
arXiv Detail & Related papers (2025-11-19T18:59:22Z)
Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning [26.869573782008217]
We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models.<n>In the scaffolding stage, Geo-R1 instills a geospatial thinking paradigm" via supervised fine-tuning on synthetic chain-of-thought exemplars.<n>In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy.
arXiv Detail & Related papers (2025-09-29T21:34:55Z)
Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning [37.90271368636318]
Referring expression understanding in remote sensing poses unique challenges.<n>We propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring.
arXiv Detail & Related papers (2025-09-26T07:01:12Z)
TurnBack: A Geospatial Route Cognition Benchmark for Large Language Models through Reverse Route [45.16008377814563]
We create a large-scale evaluation dataset comprised of 36000 routes from 12 metropolises worldwide.<n>We introduce PathBuilder, a novel tool for converting natural language instructions into navigation routes.<n>We rigorously assess 11 state-of-the-art (SOTA) LLMs on the task of route reversal.
arXiv Detail & Related papers (2025-09-17T15:00:03Z)
GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains [20.788130896943663]
Geo Reason Enhancement (GRE) Suite is a novel framework that augments Visual Language Models with structured reasoning chains for interpretable location inference.<n>First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis.<n>Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision.
arXiv Detail & Related papers (2025-05-24T13:48:57Z)
OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence [51.0456395687016]
multimodal large language models (LLMs) have opened new frontiers in artificial intelligence.<n>We propose a MLLM (OmniGeo) tailored to geospatial applications.<n>By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems.
arXiv Detail & Related papers (2025-03-20T16:45:48Z)
GOMAA-Geo: GOal Modality Agnostic Active Geo-localization [49.599465495973654]
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities. GOMAA-Geo is a goal modality active geo-localization agent for zero-shot generalization between different goal modalities.
arXiv Detail & Related papers (2024-06-04T02:59:36Z)
GeoGalactica: A Scientific Large Language Model in Geoscience [95.15911521220052]
Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP) We specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning
arXiv Detail & Related papers (2023-12-31T09:22:54Z)
GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods. Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.