RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning
- URL: http://arxiv.org/abs/2601.21634v1
- Date: Thu, 29 Jan 2026 12:35:57 GMT
- Title: RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning
- Authors: Shiqi Huang, Shuting He, Bihan Wen,
- Abstract summary: Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
- Score: 61.84363374647606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.
Related papers
- VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction [0.0]
VLMFusionOcc3D is a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving.<n>We introduce Weather-Aware Adaptive Fusion, a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions.<n>Our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.
arXiv Detail & Related papers (2026-03-03T05:22:28Z) - Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing [9.357861053928898]
Unified remote sensing multimodal models exhibit a pronounced spatial reversal curse.<n>We propose Uni-RS, the first unified model tailored for remote sensing.<n>We show that our approach substantially improves spatial faithfulness in text-to-image generation.
arXiv Detail & Related papers (2026-01-25T03:22:26Z) - SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing [57.609801041296095]
Vision-language models (VLMs) are emerging as powerful tools for remote sensing.<n>We enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism.
arXiv Detail & Related papers (2025-12-09T18:15:43Z) - SVRecon: Sparse Voxel Rasterization for Surface Reconstruction [60.92372415355283]
We extend the recently proposed sparse voxelization paradigm to the task of high-fidelity surface reconstruction by integrating SVRecon.<n>Our method achieves strong reconstruction accuracy while having consistently speedy convergence.
arXiv Detail & Related papers (2025-11-21T16:32:01Z) - Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images [51.74614065919118]
This paper introduces SegEarth-OV, the first framework for annotation-free open-vocabulary segmentation of RS images.<n>We propose SimFeatUp, a universal upsampler that robustly restores high-resolution spatial details from coarse features.<n>We also present a simple yet effective Global Bias Alleviation operation to subtract the inherent global context from patch features.
arXiv Detail & Related papers (2025-08-25T14:22:57Z) - DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models [9.109484087832058]
DiffRIS is a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for RRSIS tasks.<n>Our framework introduces two key innovations: a context perception adapter (CP-adapter) and a cross-modal reasoning decoder (PCMRD)
arXiv Detail & Related papers (2025-06-23T02:38:56Z) - SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization [57.484274282231226]
We propose SVQA-R1, the first framework to extend R1-style training to spatial VQA.<n>In particular, we introduce Spatial-GRPO, a novel group-wise RL strategy that constructs view-consistent rewards by perturbing spatial relations between objects.<n>Our model, SVQA-R1, not only dramatically improved accuracy on spatial VQA benchmarks but also exhibits interpretable reasoning paths even without using supervised fine-tuning data.
arXiv Detail & Related papers (2025-06-02T06:58:43Z) - RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation [22.877384781595556]
We introduce Range-Aware Pointwise Distance Distribution (RAPiD) features and the associated RAPiD-Seg architecture.
RAPiD features exhibit rigid transformation invariance and effectively adapt to variations in point density.
We propose a double-nested autoencoder structure with a novel class-aware embedding objective to encode high-dimensional features into manageable voxel-wise embeddings.
arXiv Detail & Related papers (2024-07-14T10:59:34Z) - SIRI: Spatial Relation Induced Network For Spatial Description
Resolution [64.38872296406211]
We propose a novel relationship induced (SIRI) network for language-guided localization.
We show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius.
Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.
arXiv Detail & Related papers (2020-10-27T14:04:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.