Related papers: Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search

Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search

URL: http://arxiv.org/abs/2511.20460v2
Date: Fri, 28 Nov 2025 13:03:57 GMT
Title: Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search
Authors: Yunqi Zhou, Chengjie Jiang, Chun Yuan, Jing Li,
Abstract summary: ZoomSearch is a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA)<n>When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS.
Score: 44.758226499411904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.

Related papers

Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding [78.26501371437013]
Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition.<n>We find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors.<n>We propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures; and (2) "pre-warming" on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL.
arXiv Detail & Related papers (2026-02-15T16:40:33Z)
GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery [69.05066425853326]
"thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools.<n>This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny.<n>We propose GeoEyes, a training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom
arXiv Detail & Related papers (2026-02-15T15:50:55Z)
ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks [49.99788276124186]
Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm.<n>We present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing.<n>We propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance.
arXiv Detail & Related papers (2025-11-15T15:47:46Z)
A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval [5.831764081074079]
We propose a computation and memory-efficient retrieval framework for RSTIR.<n>To reduce the training memory consumption, we propose the Focus-Adapter module.<n>Our proposed method reduces memory consumption by 49% and has a 1.4x data throughput during training.
arXiv Detail & Related papers (2025-01-18T02:51:43Z)
Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution [49.902047563260496]
We develop the first attempt to integrate the Vision State Space Model (Mamba) for remote sensing image (RSI) super-resolution. To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR. Our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM)
arXiv Detail & Related papers (2024-05-08T11:09:24Z)
RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images. RSM is specifically designed to capture the global context of remote sensing images with linear complexity. Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z)
A Novel Multi-scale Attention Feature Extraction Block for Aerial Remote Sensing Image Classification [9.388978548253755]
We propose a novel plug-and-play multi-scale attention feature extraction block (MSAFEB) based on multi-scale convolution at two levels with skip connection. The experimental study on two benchmark VHR aerial RS image datasets (AID and NWPU) demonstrates that our proposal achieves a stable/consistent performance (minimum standard deviation of $0.002$) and competent overall classification performance (AID: 95.85% and NWPU: 94.09%)
arXiv Detail & Related papers (2023-08-27T11:49:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.