GRASP: Guided Region-Aware Sparse Prompting for Adapting MLLMs to Remote Sensing
- URL: http://arxiv.org/abs/2601.17089v1
- Date: Fri, 23 Jan 2026 10:12:59 GMT
- Title: GRASP: Guided Region-Aware Sparse Prompting for Adapting MLLMs to Remote Sensing
- Authors: Qigan Sun, Chaoning Zhang, Jianwei Zhang, Xudong Wang, Jiehui Xie, Pengcheng Zheng, Haoyu Wang, Sungyoung Lee, Chi-lok Andy Tai, Yang Yang, Heng Tao Shen,
- Abstract summary: We propose a parameter-efficient fine-tuning (PEFT) strategy called Guided Region-Aware Sparse Prompting (GRASP)<n>GRASP introduces spatially structured soft prompts associated with spatial blocks extracted from a frozen visual token grid.<n>Experiments on multiple RSVQA benchmarks show that GRASP achieves competitive performance compared to existing fine-tuning and prompt-based methods.
- Score: 50.961694646995376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, Multimodal Large Language Models (MLLMs) have made significant progress in visual question answering tasks. However, directly applying existing fine-tuning methods to remote sensing (RS) images often leads to issues such as overfitting on background noise or neglecting target details. This is primarily due to the large-scale variations, sparse target distributions, and complex regional semantic features inherent in RS images. These challenges limit the effectiveness of MLLMs in RS tasks. To address these challenges, we propose a parameter-efficient fine-tuning (PEFT) strategy called Guided Region-Aware Sparse Prompting (GRASP). GRASP introduces spatially structured soft prompts associated with spatial blocks extracted from a frozen visual token grid. Through a question-guided sparse fusion mechanism, GRASP dynamically aggregates task-specific context into a compact global prompt, enabling the model to focus on relevant regions while filtering out background noise. Extensive experiments on multiple RSVQA benchmarks show that GRASP achieves competitive performance compared to existing fine-tuning and prompt-based methods while maintaining high parameter efficiency.
Related papers
- RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation [71.2136732268131]
RGB-Thermal (RGBT) tracking aims to achieve robust object localization across diverse environmental conditions.<n>Existing RGBT trackers rely solely on initial-frame visual information for target modeling.<n>We propose RAGTrack, a novel Retrieval-Augmented Generation framework for robust RGBT tracking.
arXiv Detail & Related papers (2026-03-04T01:02:04Z) - ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks [49.99788276124186]
Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm.<n>We present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing.<n>We propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance.
arXiv Detail & Related papers (2025-11-15T15:47:46Z) - FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z) - Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images [51.74614065919118]
This paper introduces SegEarth-OV, the first framework for annotation-free open-vocabulary segmentation of RS images.<n>We propose SimFeatUp, a universal upsampler that robustly restores high-resolution spatial details from coarse features.<n>We also present a simple yet effective Global Bias Alleviation operation to subtract the inherent global context from patch features.
arXiv Detail & Related papers (2025-08-25T14:22:57Z) - GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing [5.653111274028541]
We propose GeoMag, an end-to-end general-purpose large model framework for remote sensing.<n>GeoMag focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing.<n>This approach improves the model's perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery.
arXiv Detail & Related papers (2025-07-08T11:21:03Z) - EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [24.41705039390567]
EmbodiedVSR (Embodied Visual Spatial Reasoning) is a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning.<n>Our method enables zero-shot spatial reasoning without task-specific fine-tuning.<n>Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence.
arXiv Detail & Related papers (2025-03-14T05:06:07Z) - FANet: Feature Amplification Network for Semantic Segmentation in Cluttered Background [9.970265640589966]
Existing deep learning approaches leave out the semantic cues that are crucial in semantic segmentation present in complex scenarios.
We propose a feature amplification network (FANet) as a backbone network that incorporates semantic information using a novel feature enhancement module at multi-stages.
Our experimental results demonstrate the state-of-the-art performance compared to existing methods.
arXiv Detail & Related papers (2024-07-12T15:57:52Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Learning to Aggregate Multi-Scale Context for Instance Segmentation in
Remote Sensing Images [28.560068780733342]
A novel context aggregation network (CATNet) is proposed to improve the feature extraction process.
The proposed model exploits three lightweight plug-and-play modules, namely dense feature pyramid network (DenseFPN), spatial context pyramid ( SCP), and hierarchical region of interest extractor (HRoIE)
arXiv Detail & Related papers (2021-11-22T08:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.