SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding
- URL: http://arxiv.org/abs/2406.10100v2
- Date: Mon, 8 Jul 2024 04:33:37 GMT
- Title: SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding
- Authors: Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, Yansheng Li,
- Abstract summary: We propose a large-scale instruction tuning dataset FIT-RS, containing 1,800,851 instruction samples.
FIT-RS covers common interpretation tasks and innovatively introduces several complex comprehension tasks of escalating difficulty.
We establish a new benchmark to evaluate the fine-grained relation comprehension capabilities of LMMs, named FIT-RSRC.
- Score: 26.08043905865113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Remote Sensing Large Multi-Modal Models (RSLMMs) are developing rapidly and showcase significant capabilities in remote sensing imagery (RSI) comprehension. However, due to the limitations of existing datasets, RSLMMs have shortcomings in understanding the rich semantic relations among objects in complex remote sensing scenes. To unlock RSLMMs' complex comprehension ability, we propose a large-scale instruction tuning dataset FIT-RS, containing 1,800,851 instruction samples. FIT-RS covers common interpretation tasks and innovatively introduces several complex comprehension tasks of escalating difficulty, ranging from relation reasoning to image-level scene graph generation. Based on FIT-RS, we build the FIT-RSFG benchmark. Furthermore, we establish a new benchmark to evaluate the fine-grained relation comprehension capabilities of LMMs, named FIT-RSRC. Based on combined instruction data, we propose SkySenseGPT, which achieves outstanding performance on both public datasets and FIT-RSFG, surpassing existing RSLMMs. We hope the FIT-RS dataset can enhance the relation comprehension capability of RSLMMs and provide a large-scale fine-grained data source for the remote sensing community. The dataset will be available at https://github.com/Luo-Z13/SkySenseGPT
Related papers
- Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries [0.12744523252873352]
RUNE (Reasoning Using Neurosymbolic Entities) is an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images.<n>Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability.<n>We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.
arXiv Detail & Related papers (2025-12-16T05:33:44Z) - Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z) - REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing [16.401094981355218]
Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping.<n>We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms.<n>We present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries.
arXiv Detail & Related papers (2025-11-21T17:41:26Z) - MARAG-R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval [50.30107119622642]
Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data.<n>Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge.<n>MarAG-R1 is a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms.
arXiv Detail & Related papers (2025-10-31T15:51:39Z) - PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization [82.96200364977737]
We introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks.<n>PlotCraft is structured around seven high-level visualization tasks and encompasses 48 distinct chart types.<n>It is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities.
arXiv Detail & Related papers (2025-10-15T10:14:39Z) - GRIL: Knowledge Graph Retrieval-Integrated Learning with Large Language Models [59.72897499248909]
We propose a novel graph retriever trained end-to-end with Large Language Models (LLMs)<n>Within the extracted subgraph, structural knowledge and semantic features are encoded via soft tokens and the verbalized graph, respectively, which are infused into the LLM together.<n>Our approach consistently achieves state-of-the-art performance, validating the strength of joint graph-LLM optimization for complex reasoning tasks.
arXiv Detail & Related papers (2025-09-20T02:38:00Z) - Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing [55.291219073365546]
Open-Vocabulary Remote Sensing Image (OVRSIS) is an emerging task that adapts Open-Vocabulary (OVS) to the remote sensing (RS) domain.<n>textbfRSKT-Seg is a novel open-vocabulary segmentation framework tailored for remote sensing.<n> RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation.
arXiv Detail & Related papers (2025-09-15T15:24:49Z) - Region-Level Context-Aware Multimodal Understanding [52.565528550835786]
Region-level Context-aware Multimodal Understanding (RCMU) is an ability to integrate textual context associated with objects for a more context-aware multimodal understanding.<n>To equip MLLMs with RCMU capabilities, we propose Region-level Context-aware Visual Instruction Tuning (RCVIT)<n>We introduce the RCMU dataset, a large-scale visual instruction tuning dataset that covers multiple RCMU tasks.<n>We also propose RC&P-Bench, a comprehensive benchmark that can evaluate the performance of MLLMs in RCMU and multimodal personalized understanding tasks.
arXiv Detail & Related papers (2025-08-17T07:18:43Z) - MambaOutRS: A Hybrid CNN-Fourier Architecture for Remote Sensing Image Classification [4.14360329494344]
We introduce MambaOutRS, a novel hybrid convolutional architecture for remote sensing image classification.<n>MambaOutRS builds upon stacked Gated CNN blocks for local feature extraction and introduces a novel Fourier Filter Gate (FFG) module.
arXiv Detail & Related papers (2025-06-24T12:20:11Z) - PolSAM: Polarimetric Scattering Mechanism Informed Segment Anything Model [76.95536611263356]
PolSAR data presents unique challenges due to its rich and complex characteristics.
Existing data representations, such as complex-valued data, polarimetric features, and amplitude images, are widely used.
Most feature extraction networks for PolSAR are small, limiting their ability to capture features effectively.
We propose the Polarimetric Scattering Mechanism-Informed SAM (PolSAM), an enhanced Segment Anything Model (SAM) that integrates domain-specific scattering characteristics and a novel prompt generation strategy.
arXiv Detail & Related papers (2024-12-17T09:59:53Z) - SpecSAR-Former: A Lightweight Transformer-based Network for Global LULC Mapping Using Integrated Sentinel-1 and Sentinel-2 [13.17346252861919]
We introduce the Dynamic World+ dataset, expanding the current authoritative multispectral dataset, Dynamic World.
To facilitate the combination of multispectral and SAR data, we propose a lightweight transformer architecture termed SpecSAR-Former.
Our network outperforms existing transformer and CNN-based models, achieving a mean Intersection over Union (mIoU) of 59.58%, an Overall Accuracy (OA) of 79.48%, and an F1 Score of 71.68% with only 26.70M parameters.
arXiv Detail & Related papers (2024-10-04T22:53:25Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding [4.266920365127677]
Under the new LaGD paradigm, the old datasets are no longer suitable for fire-new tasks.
We designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding.
The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information.
arXiv Detail & Related papers (2024-06-18T10:34:28Z) - Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution [49.902047563260496]
We develop the first attempt to integrate the Vision State Space Model (Mamba) for remote sensing image (RSI) super-resolution.
To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR.
Our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM)
arXiv Detail & Related papers (2024-05-08T11:09:24Z) - Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases [9.478012553728538]
We propose an end-to-end system design towards utilizing Retrieval Augmented Generation (RAG) to improve the factual accuracy of Large Language Models (LLMs)
Our system integrates RAG pipeline with upstream datasets processing and downstream performance evaluation.
Our experiments demonstrate the system's effectiveness in generating more accurate answers to domain-specific and time-sensitive inquiries.
arXiv Detail & Related papers (2024-03-15T16:30:14Z) - SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection [79.23689506129733]
We establish a new benchmark dataset and an open-source method for large-scale SAR object detection.
Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets.
To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created.
arXiv Detail & Related papers (2024-03-11T09:20:40Z) - SwiMDiff: Scene-wide Matching Contrastive Learning with Diffusion
Constraint for Remote Sensing Image [21.596874679058327]
SwiMDiff is a novel self-supervised pre-training framework for remote sensing images.
It recalibrates labels to recognize data from the same scene as false negatives.
It seamlessly integrates contrastive learning (CL) with a diffusion model.
arXiv Detail & Related papers (2024-01-10T11:55:58Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for
Remote Sensing Data [27.63411386396492]
This paper introduces a new benchmark dataset for multi-modal semantic segmentation based on RGB-Height (RGB-H) data.
The proposed benchmark consists of 1) a large-scale dataset including co-registered RGB and nDSM pairs and pixel-wise semantic labels; 2) a comprehensive evaluation and analysis of existing multi-modal fusion strategies for both convolutional and Transformer-based networks on remote sensing data.
arXiv Detail & Related papers (2023-05-24T09:03:18Z) - Synergistic Interplay between Search and Large Language Models for
Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections.
InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z) - Learning Detail-Structure Alternative Optimization for Blind
Super-Resolution [69.11604249813304]
We propose an effective and kernel-free network, namely DSSR, which enables recurrent detail-structure alternative optimization without blur kernel prior incorporation for blind SR.
In our DSSR, a detail-structure modulation module (DSMM) is built to exploit the interaction and collaboration of image details and structures.
Our method achieves the state-of-the-art against existing methods.
arXiv Detail & Related papers (2022-12-03T14:44:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.