Related papers: RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

URL: http://arxiv.org/abs/2504.04988v1
Date: Mon, 07 Apr 2025 12:13:43 GMT
Title: RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model
Authors: Congcong Wen, Yiting Lin, Xiaokang Qu, Nan Li, Yong Liao, Hui Lin, Xiang Li,
Abstract summary: We propose a novel Remote Sensing Retrieval-Augmented Generation (RS-RAG) framework, which consists of two key components.<n>The RS-RAG framework retrieves relevant knowledge based on image and/or text queries, and incorporates the retrieved content into a knowledge-augmented prompt.<n>We validated the effectiveness of our approach on three representative vision-language tasks, including image captioning, image classification, and visual question answering, where RS-RAG significantly outperformed state-of-the-art baselines.
Score: 16.343935641777268
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in VLMs has demonstrated impressive capabilities across a variety of tasks in the natural image domain. Motivated by these advancements, the remote sensing community has begun to adopt VLMs for remote sensing vision-language tasks, including scene understanding, image captioning, and visual question answering. However, existing remote sensing VLMs typically rely on closed-set scene understanding and focus on generic scene descriptions, yet lack the ability to incorporate external knowledge. This limitation hinders their capacity for semantic reasoning over complex or context-dependent queries that involve domain-specific or world knowledge. To address these challenges, we first introduced a multimodal Remote Sensing World Knowledge (RSWK) dataset, which comprises high-resolution satellite imagery and detailed textual descriptions for 14,141 well-known landmarks from 175 countries, integrating both remote sensing domain knowledge and broader world knowledge. Building upon this dataset, we proposed a novel Remote Sensing Retrieval-Augmented Generation (RS-RAG) framework, which consists of two key components. The Multi-Modal Knowledge Vector Database Construction module encodes remote sensing imagery and associated textual knowledge into a unified vector space. The Knowledge Retrieval and Response Generation module retrieves and re-ranks relevant knowledge based on image and/or text queries, and incorporates the retrieved content into a knowledge-augmented prompt to guide the VLM in producing contextually grounded responses. We validated the effectiveness of our approach on three representative vision-language tasks, including image captioning, image classification, and visual question answering, where RS-RAG significantly outperformed state-of-the-art baselines.

Related papers

Knowledge-aware Visual Question Generation for Remote Sensing Images [18.383561647568502]
We propose a knowledge-aware remote sensing visual question generation model, KRSVQG.<n>The model takes an image and a related knowledge triplet from external knowledge sources as inputs.<n>Results on two datasets demonstrate that KRSVQG outperforms existing methods.
arXiv Detail & Related papers (2026-02-22T15:18:01Z)
Questions beyond Pixels: Integrating Commonsense Knowledge in Visual Question Generation for Remote Sensing [18.383561647568502]
We propose a Knowledge-aware Remote Sensing Visual Question Generation model (KRSVQG)<n>The proposed model incorporates related knowledge triplets from external knowledge sources to broaden the question content.<n>KRSVQG utilizes a vision-language pre-training and fine-tuning strategy, enabling the model's adaptation to low data regimes.
arXiv Detail & Related papers (2026-02-22T14:59:00Z)
SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing [57.609801041296095]
Vision-language models (VLMs) are emerging as powerful tools for remote sensing.<n>We enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism.
arXiv Detail & Related papers (2025-12-09T18:15:43Z)
DescribeEarth: Describe Anything for Remote Sensing Images [56.04533626223295]
We propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing.<n>To support this task, we construct DE-Dataset, a large-scale dataset with detailed descriptions of object attributes, relationships, and contexts.<n>We also present DescribeEarth, a Multi-modal Large Language Model architecture explicitly designed for Geo-DLC.
arXiv Detail & Related papers (2025-09-30T01:53:34Z)
SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model [23.383837540690823]
High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring.<n>Due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation.<n>Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition.
arXiv Detail & Related papers (2025-05-29T02:38:34Z)
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation [12.32553804641971]
Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding.<n>This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M.
arXiv Detail & Related papers (2025-02-12T07:19:36Z)
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities.<n>We propose a novel framework based on multimodal retrieval-augmented generation (RAG)<n>RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z)
Towards Visual Grounding: A Survey [99.0950608237702]
Visual Grounding, also known as Referring Expression and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text.<n>Since 2021, visual grounding has witnessed significant advancements, with emerging new concepts such as grounded pre-training.<n>This paper represents the most comprehensive overview currently available in the field of visual grounding.
arXiv Detail & Related papers (2024-12-28T16:34:35Z)
From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing [16.755590790629153]
This review examines the development and application of multi-modal language models (MLLMs) in remote sensing. We focus on their ability to interpret and describe satellite imagery using natural language. Key applications such as scene description, object detection, change detection, text-to-image retrieval, image-to-text generation, and visual question answering are discussed.
arXiv Detail & Related papers (2024-11-05T12:14:22Z)
Locate Anything on Earth: Advancing Open-Vocabulary Object Detection for Remote Sensing Community [58.417475846791234]
We propose and train the novel LAE-DINO Model, the first open-vocabulary foundation object detector for the LAE task.<n>We conduct experiments on established remote sensing benchmark DIOR, DOTAv2.0, as well as our newly introduced 80-class LAE-80C benchmark.<n>Results demonstrate the advantages of the LAE-1M dataset and the effectiveness of the LAE-DINO method.
arXiv Detail & Related papers (2024-08-17T06:24:43Z)
EarthMarker: A Visual Prompting Multi-modal Large Language Model for Remote Sensing [12.9701635989222]
It is difficult to deliver information in complicated remote sensing (RS) scenarios using plain language instructions alone. EarthMarker is capable of interpreting RS imagery at the image, region, and point levels by levering visual prompts.
arXiv Detail & Related papers (2024-07-18T15:35:00Z)
Augmented Commonsense Knowledge for Remote Object Grounding [67.30864498454805]
We propose an augmented commonsense knowledge model (ACK) to leverage commonsense information as atemporal knowledge graph for improving agent navigation. ACK consists of knowledge graph-aware cross-modal and concept aggregation modules to enhance visual representation and visual-textual data alignment. We add a new pipeline for the commonsense-based decision-making process which leads to more accurate local action prediction.
arXiv Detail & Related papers (2024-06-03T12:12:33Z)
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z)
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment [61.769441954135246]
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language.
arXiv Detail & Related papers (2023-12-12T03:39:07Z)
GeoChat: Grounded Large Vision-Language Model for Remote Sensing [65.78360056991247]
We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images. Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue. GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
arXiv Detail & Related papers (2023-11-24T18:59:10Z)
Dual Semantic Knowledge Composed Multimodal Dialog Systems [114.52730430047589]
We propose a novel multimodal task-oriented dialog system named MDS-S2. It acquires the context related attribute and relation knowledge from the knowledge base. We also devise a set of latent query variables to distill the semantic information from the composed response representation.
arXiv Detail & Related papers (2023-05-17T06:33:26Z)
Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model [63.461030694700014]
We propose a novel dual knowledge-enhanced generative pretrained language model for multimodal task-oriented dialog systems (DKMD) The proposed DKMD consists of three key components: dual knowledge selection, dual knowledge-enhanced context learning, and knowledge-enhanced response generation. Experiments on a public dataset verify the superiority of the proposed DKMD over state-of-the-art competitors.
arXiv Detail & Related papers (2022-07-16T13:02:54Z)
External Knowledge Augmented Text Visual Question Answering [0.6445605125467573]
We propose a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks. We generate results comparable to the state-of-the-art on two publicly available datasets.
arXiv Detail & Related papers (2021-08-22T13:21:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.