Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
- URL: http://arxiv.org/abs/2509.21976v1
- Date: Fri, 26 Sep 2025 07:01:12 GMT
- Title: Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
- Authors: Zilun Zhang, Zian Guan, Tiancheng Zhao, Haozhan Shen, Tianyu Li, Yuxiang Cai, Zhonggen Su, Zhaojun Liu, Jianwei Yin, Xiang Li,
- Abstract summary: Referring expression understanding in remote sensing poses unique challenges.<n>We propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring.<n>We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines.
- Score: 37.90271368636318
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object-context relationships. While supervised fine-tuning (SFT) on multimodal large language models achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 enforces the model to first generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects. This "reason first, then act" process enables the model to make more effective use of limited annotations, enhances generalization, and provides interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness. Code and data will be released at http://geo-r1.github.io.
Related papers
- GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery [12.65874706732698]
We present GeoSeg, a zero-shot, training-free framework that bypasses the supervision bottleneck for reasoning-driven remote sensing segmentation.<n>GeoSeg couples MLLM reasoning with precise localization via: (i) bias-aware coordinate refinement to correct systematic grounding shifts and (ii) a dual-route prompting mechanism to fuse semantic intent with fine-grained spatial cues.<n>Experiments show that GeoSeg consistently outperforms all baselines, with extensive ablations confirming the effectiveness and necessity of each component.
arXiv Detail & Related papers (2026-03-04T12:24:16Z) - Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning [52.075928878249066]
Vision-guided models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements.<n>We introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language representations.<n>We propose GeoDPO, a translator reinforcement learning framework.
arXiv Detail & Related papers (2026-02-26T07:28:04Z) - RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z) - SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images [49.52402091341301]
Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios.<n>We present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation.<n>We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS.
arXiv Detail & Related papers (2025-12-23T03:10:17Z) - GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding [14.436063587920005]
We introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain.<n>It achieves significant gains in image captioning, visual grounding, and multi-object detection.<n>Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.
arXiv Detail & Related papers (2025-12-02T07:59:46Z) - UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes [18.631940492768898]
We introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation.<n>GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets.<n>We also present UniGeoSeg, a unified framework that incorporates task-aware text enhancement, latent knowledge memory, and a progressive training strategy.
arXiv Detail & Related papers (2025-11-28T16:40:08Z) - GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes [84.52881742231152]
Multimodal large language models (MLLMs) have undergone rapid development in advancing geospatial scene understanding.<n>Recent studies have sought to enhance the reasoning capabilities of remote sensing MLLMs, typically through cold-start training with elaborately curated chain-of-thought (CoT) data.<n>We propose GeoZero, a framework that enables MLLMs to perform geospatial reasoning without any predefined CoT supervision.
arXiv Detail & Related papers (2025-11-27T17:28:09Z) - Geo-R1: Unlocking VLM Geospatial Reasoning with Cross-View Reinforcement Learning [26.869573782008217]
We introduce Geo-R1, a reasoning-centric post-training framework that unlocks geospatial reasoning in vision-language models.<n>In the scaffolding stage, Geo-R1 instills a geospatial thinking paradigm" via supervised fine-tuning on synthetic chain-of-thought exemplars.<n>In the elevating stage, it uses GRPO-based reinforcement learning on a weakly-supervised cross-view pairing proxy.
arXiv Detail & Related papers (2025-09-29T21:34:55Z) - Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models [8.021952962029165]
Vision-Language Models (VLMs) in remote sensing often fail at complex analytical tasks.<n>We introduce the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT)<n>Geo-CoT is a framework that models remote sensing analysis as a verifiable, multi-step process.
arXiv Detail & Related papers (2025-09-26T11:34:42Z) - GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions [45.70578816057097]
We introduce the task of Referring Expression (REC) for geometric problems.<n>REC evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts.<n>We generate a large-scale synthetic training dataset using a structured geometric formal language.
arXiv Detail & Related papers (2025-09-25T12:00:52Z) - GeoSR: Cognitive-Agentic Framework for Probing Geospatial Knowledge Boundaries via Iterative Self-Refinement [4.026524042818433]
GeoSR is a self-refining agentic reasoning framework that embeds core geographic principles into an iterative prediction loop.<n>We validate GeoSR on tasks ranging from physical-world property estimation to socioeconomic prediction.
arXiv Detail & Related papers (2025-08-06T04:45:34Z) - TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving [106.04001249574786]
TrustGeoGen is a data engine that generates formally verified geometric problems to establish a principled and trustworthy benchmark.<n>Our engine integrates four key innovations: 1) Multimodal Alignment, which synchronizes the generation of diagrams, text, and step-by-step solutions; 2) Formal Verification, ensuring all reasoning paths are rule-compliant; 3) Connection Thinking, bridging formal deduction with human-like logical steps; and 4) our textitGeoExplore series algorithms, which produce diverse problem variants with multiple solutions and self-reflective backtracking.
arXiv Detail & Related papers (2025-04-22T10:45:23Z) - Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z) - SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model [61.97017867656831]
We introduce a new task, ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region.<n>We construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs.<n>SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods.
arXiv Detail & Related papers (2025-04-13T16:36:47Z) - GeoAggregator: An Efficient Transformer Model for Geo-Spatial Tabular Data [5.40483645224129]
This paper introduces GeoAggregator, an efficient and lightweight algorithm for geospatial data modeling.<n>We benchmark it against spatial statistical models, XGBoost, and several state-of-the-art geospatial deep learning methods.<n>Results demonstrate that GeoAggregators achieve the best or second-best performance compared to their competitors on nearly all datasets.
arXiv Detail & Related papers (2025-02-20T20:39:15Z) - GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image [94.56927147492738]
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes from single images.
We show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage.
We propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions.
arXiv Detail & Related papers (2024-03-18T17:50:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.