Related papers: GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing

GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing

URL: http://arxiv.org/abs/2507.05887v2
Date: Fri, 18 Jul 2025 12:41:26 GMT
Title: GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing
Authors: Xianzhi Ma, Jianhui Li, Changhua Pei, Hao Liu,
Abstract summary: We propose GeoMag, an end-to-end general-purpose large model framework for remote sensing.<n>GeoMag focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing.<n>This approach improves the model's perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery.
Score: 5.653111274028541
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The application of Vision-Language Models (VLMs) in remote sensing (RS) image understanding has achieved notable progress, demonstrating the basic ability to recognize and describe geographical entities. However, existing RS-VLMs are mostly limited to image-level and region-level tasks, lacking the capability to handle pixel-level tasks and performing poorly in small-object recognition scenarios. Moreover, RS-VLMs consume significant computational resources when processing high-resolution RS images, further restricting their practical applicability. In this context, we propose GeoMag (Geographical Magnifier), an end-to-end general-purpose large model framework for RS. GeoMag dynamically focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing across multiple levels of granularity. This method introduces Task-driven Multi-granularity Resolution Adjustment (TMRA) and Prompt-guided Semantic-aware Cropping (PSC), which adaptively reduce the spatial resolution of task-irrelevant regions while enhancing the visual representation of task-relevant areas. This approach improves the model's perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery. Extensive comparative experiments on 10 benchmarks demonstrate that GeoMag not only excels in handling pixel-level tasks but also maintains competitive performance across tasks of other granularities compared to existing RS-VLMs.

Related papers

SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model [23.383837540690823]
High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring.<n>Due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation.<n>Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition.
arXiv Detail & Related papers (2025-05-29T02:38:34Z)
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing [33.19843463374473]
Vision-Language Models (VLMs) in remote sensing have demonstrated significant potential in traditional tasks.<n>Current models, which excel in Referring Expression (REC), struggle with tasks involving complex instructions.<n>We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT)<n>We propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition and a self-augmentation strategy based on cyclic referring.
arXiv Detail & Related papers (2025-03-16T12:48:17Z)
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning [31.696397337675847]
Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images.<n>We propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration.<n>Our method outperforms existing high-resolution strategies on four datasets using the same data.
arXiv Detail & Related papers (2025-03-10T17:51:16Z)
GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing [32.85223015863783]
GeoPixel is an end-to-end high resolution RS-LMM that supports pixel-level grounding.<n>It supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis.<n>GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks.
arXiv Detail & Related papers (2025-01-23T18:59:30Z)
Efficient Visual State Space Model for Image Deblurring [99.54894198086852]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.<n>We propose a simple yet effective visual state space model (EVSSM) for image deblurring.<n>The proposed EVSSM performs favorably against state-of-the-art methods on benchmark datasets and real-world images.
arXiv Detail & Related papers (2024-05-23T09:13:36Z)
RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images. RSM is specifically designed to capture the global context of remote sensing images with linear complexity. Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
Memory-augmented Deep Unfolding Network for Guided Image Super-resolution [67.83489239124557]
Guided image super-resolution (GISR) aims to obtain a high-resolution (HR) target image by enhancing the spatial resolution of a low-resolution (LR) target image under the guidance of a HR image. Previous model-based methods mainly takes the entire image as a whole, and assume the prior distribution between the HR target image and the HR guidance image. We propose a maximal a posterior (MAP) estimation model for GISR with two types of prior on the HR target image.
arXiv Detail & Related papers (2022-02-12T15:37:13Z)
High-resolution Depth Maps Imaging via Attention-based Hierarchical Multi-modal Fusion [84.24973877109181]
We propose a novel attention-based hierarchical multi-modal fusion network for guided DSR. We show that our approach outperforms state-of-the-art methods in terms of reconstruction accuracy, running speed and memory efficiency.
arXiv Detail & Related papers (2021-04-04T03:28:33Z)
Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network. Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.