Related papers: GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

URL: http://arxiv.org/abs/2501.13925v1
Date: Thu, 23 Jan 2025 18:59:30 GMT
Title: GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing
Authors: Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S. Khan, Salman Khan,
Abstract summary: GeoPixel is an end-to-end high resolution RS-LMM that supports pixel-level grounding.<n>It supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis.<n>GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks.
Score: 32.85223015863783
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models perform poorly for remote sensing (RS). The distinct overhead viewpoint, scale variation, and presence of small objects in high-resolution RS imagery present a unique challenge in region-level comprehension. Moreover, the development of the grounding conversation capability of LMMs within RS is hindered by the lack of granular, RS domain-specific grounded data. Addressing these limitations, we propose GeoPixel - the first end-to-end high resolution RS-LMM that supports pixel-level grounding. This capability allows fine-grained visual perception by generating interleaved masks in conversation. GeoPixel supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis. To support the grounded conversation generation (GCG) in RS imagery, we curate a visually grounded dataset GeoPixelD through a semi-automated pipeline that utilizes set-of-marks prompting and spatial priors tailored for RS data to methodically control the data generation process. GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. Our code and data will be publicly released.

Related papers

GRASP: Guided Region-Aware Sparse Prompting for Adapting MLLMs to Remote Sensing [50.961694646995376]
We propose a parameter-efficient fine-tuning (PEFT) strategy called Guided Region-Aware Sparse Prompting (GRASP)<n>GRASP introduces spatially structured soft prompts associated with spatial blocks extracted from a frozen visual token grid.<n>Experiments on multiple RSVQA benchmarks show that GRASP achieves competitive performance compared to existing fine-tuning and prompt-based methods.
arXiv Detail & Related papers (2026-01-23T10:12:59Z)
ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks [49.99788276124186]
Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm.<n>We present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing.<n>We propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance.
arXiv Detail & Related papers (2025-11-15T15:47:46Z)
FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z)
GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing [5.653111274028541]
We propose GeoMag, an end-to-end general-purpose large model framework for remote sensing.<n>GeoMag focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing.<n>This approach improves the model's perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery.
arXiv Detail & Related papers (2025-07-08T11:21:03Z)
CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image [86.75098349480014]
This paper tackles category-level pose estimation of articulated objects in robotic manipulation tasks. We propose a single-stage Network, CAP-Net, for estimating the 6D poses and sizes of Categorical Articulated Parts. We introduce the RGBD-Art dataset, the largest RGB-D articulated dataset to date, featuring RGB images and depth noise simulated from real sensors.
arXiv Detail & Related papers (2025-04-15T14:30:26Z)
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning [31.696397337675847]
Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images. We propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method outperforms existing high-resolution strategies on four datasets using the same data.
arXiv Detail & Related papers (2025-03-10T17:51:16Z)
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing [22.729750410621826]
GeoPix is a RS MLLM that extends image understanding capabilities to the pixel level.<n>To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor.<n>To address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset.
arXiv Detail & Related papers (2025-01-12T14:45:27Z)
MMO-IG: Multi-Class and Multi-Scale Object Image Generation for Remote Sensing [12.491684385808902]
MMO-IG is designed to generate RS images with supervised object labels from global and local aspects simultaneously.<n>Considering the complex interdependencies among MMOs, we construct a spatial-cross dependency knowledge graph.<n>Our MMO-IG exhibits superior generation capabilities for RS images with dense MMO-supervised labels.
arXiv Detail & Related papers (2024-12-18T10:19:12Z)
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity.<n> RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning.<n>We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z)
PGNeXt: High-Resolution Salient Object Detection via Pyramid Grafting Network [24.54269823691119]
We present an advanced study on more challenging high-resolution salient object detection (HRSOD) from both dataset and network framework perspectives. To compensate for the lack of HRSOD dataset, we thoughtfully collect a large-scale high resolution salient object detection dataset, called UHRSD. All the images are finely annotated in pixel-level, far exceeding previous low-resolution SOD datasets.
arXiv Detail & Related papers (2024-08-02T09:31:21Z)
Scaling Efficient Masked Image Modeling on Large Remote Sensing Dataset [66.15872913664407]
We present a new pre-training pipeline for RS models, featuring the creation of a large-scale RS dataset and an efficient MIM approach.<n>We curated a high-quality dataset named OpticalRS-13M by collecting publicly available RS datasets and processing them through exclusion, slicing, and deduplication.<n>Experiments demonstrate that OpticalRS-13M significantly improves classification, detection, and segmentation performance, while SelectiveMAE increases training efficiency over 2 times.
arXiv Detail & Related papers (2024-06-17T15:41:57Z)
RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images. RSM is specifically designed to capture the global context of remote sensing images with linear complexity. Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
GeoChat: Grounded Large Vision-Language Model for Remote Sensing [65.78360056991247]
We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images. Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue. GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
arXiv Detail & Related papers (2023-11-24T18:59:10Z)
High-resolution Depth Maps Imaging via Attention-based Hierarchical Multi-modal Fusion [84.24973877109181]
We propose a novel attention-based hierarchical multi-modal fusion network for guided DSR. We show that our approach outperforms state-of-the-art methods in terms of reconstruction accuracy, running speed and memory efficiency.
arXiv Detail & Related papers (2021-04-04T03:28:33Z)
Deep Burst Super-Resolution [165.90445859851448]
We propose a novel architecture for the burst super-resolution task. Our network takes multiple noisy RAW images as input, and generates a denoised, super-resolved RGB image as output. In order to enable training and evaluation on real-world data, we additionally introduce the BurstSR dataset.
arXiv Detail & Related papers (2021-01-26T18:57:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.