Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding
- URL: http://arxiv.org/abs/2602.14225v1
- Date: Sun, 15 Feb 2026 16:40:33 GMT
- Title: Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding
- Authors: Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yuhao Zhou, Di Wang, Yifan Zhang, Haoyu Wang, Haiyan Zhao, Hongda Sun, Long Lan, Jun Song, Yulin Wang, Jing Zhang, Wenlong Zhang, Bo Du,
- Abstract summary: Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition.<n>We find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors.<n>We propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures; and (2) "pre-warming" on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL.
- Score: 78.26501371437013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition: the model necessitates localizing tiny task-relevant regions in massive pixel spaces. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) using zoom-in tools offers a path forward, we find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors. In this paper, we investigate the interplay between post-training paradigms: comparing Cold-start Supervised Fine-Tuning (SFT), RLVR, and Agentic RLVR on the UHR RS benchmark.Our controlled studies yield a counter-intuitive finding: high-quality Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Despite lacking images, domain-specific text injects the concepts, mechanistic explanations, and decision rules necessary to guide visual evidence retrieval.Based on this, we propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures;and (2) "pre-warming" on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL. This approach achieves a 60.40% Pass@1 on XLRS-Bench, significantly outperforming larger general purpose models (e.g., GPT-5.2, Gemini 3.0 Pro, Intern-S1) and establishing a new state-of-the-art.
Related papers
- FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery [8.62554606349568]
FUSAR-GPT is a VLM specifically for Synthetic Aperture Radar (SAR) applications.<n>It embeds multi-source remote-sensing temporal features into the model's visual backbone via'spatiotemporal anchors'<n>It achieves state-of-the-art performance across several typical remote sensing visual-language benchmark tests.
arXiv Detail & Related papers (2026-02-22T13:40:17Z) - GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery [69.05066425853326]
"thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools.<n>This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny.<n>We propose GeoEyes, a training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom
arXiv Detail & Related papers (2026-02-15T15:50:55Z) - Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage [4.771792258699647]
We propose textbfHead Aware Visual Cropping (HAVC), a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads.<n> experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies.
arXiv Detail & Related papers (2026-01-30T02:46:55Z) - ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z) - The Path Not Taken: RLVR Provably Learns Off the Principals [85.41043469428365]
We show that sparsity is a surface artifact of a model-conditioned optimization bias.<n>We mechanistically explain these dynamics with a Three-Gate Theory.<n>We provide a parameter-level characterization of RLVR's learning dynamics.
arXiv Detail & Related papers (2025-11-11T18:49:45Z) - Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning [93.19037653970622]
We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images.<n>Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities.<n>Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
arXiv Detail & Related papers (2025-10-31T16:30:08Z) - An Empirical Study of Remote Sensing Pretraining [117.90699699469639]
We conduct an empirical study of remote sensing pretraining (RSP) on aerial images.
RSP can help deliver distinctive performances in scene recognition tasks.
RSP mitigates the data discrepancies of traditional ImageNet pretraining on RS images, but it may still suffer from task discrepancies.
arXiv Detail & Related papers (2022-04-06T13:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.