RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering
- URL: http://arxiv.org/abs/2508.07918v1
- Date: Mon, 11 Aug 2025 12:32:48 GMT
- Title: RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering
- Authors: Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, Mukesh Prasad,
- Abstract summary: This paper introduces RSVLM-QA dataset, a new large-scale, content-rich VQA dataset for the RS domain.<n> RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types.<n>We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks.
- Score: 5.8161272823734205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces RSVLM-QA dataset, a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA is constructed by integrating data from several prominent RS segmentation and detection datasets: WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA's annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field.
Related papers
- VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering [53.662676566188175]
A key bottleneck lies in the lack of public, large-scale, high-quality Scientific Visual Question Answering (SVQA) datasets.<n>We propose a verification-centric Generate-then-Verify framework that first generates QA pairs with figure-associated textual context.<n>We instantiate this framework to curate VeriSciQA, a dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types.
arXiv Detail & Related papers (2025-11-25T04:14:52Z) - ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering [2.6309739988261795]
We propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task.<n>Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a low-cost framework.
arXiv Detail & Related papers (2025-07-22T09:55:09Z) - AMAQA: A Metadata-based QA Dataset for RAG Systems [7.882922366782987]
We present AMAQA, a new open-access QA dataset designed to evaluate tasks combining text and metadata.<n>AMAQA includes about 1.1 million English messages collected from 26 public Telegram groups.<n>We show that leveraging metadata boosts accuracy from 0.12 to 0.61, highlighting the value of structured context.
arXiv Detail & Related papers (2025-05-19T08:59:08Z) - Copy-Move Forgery Detection and Question Answering for Remote Sensing Image [24.984627968280872]
This paper introduces the Remote Sensing Copy-Move Question Answering (RSCMQA) task.<n>Unlike traditional Remote Sensing Visual Question Answering (RSVQA), RSCMQA focuses on interpreting complex tampering scenarios.<n>We present a suite of global RSCMQA datasets, comprising images from 29 different regions across 14 countries.
arXiv Detail & Related papers (2024-12-03T17:02:40Z) - Fine-grained Late-interaction Multi-modal Retrieval for Retrieval
Augmented Visual Question Answering [56.96857992123026]
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions.
This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA.
arXiv Detail & Related papers (2023-09-29T10:54:10Z) - RSGPT: A Remote Sensing Vision Language Model and Benchmark [7.279747655485913]
We build a high-quality Remote Sensing Image Captioning dataset (RSICap)
This dataset comprises 2,585 human-annotated captions with rich and high-quality information.
We also provide a benchmark evaluation dataset called RSIEval.
arXiv Detail & Related papers (2023-07-28T02:23:35Z) - LMGQS: A Large-scale Dataset for Query-focused Summarization [77.6179359525065]
We convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS.
We establish baselines with state-of-the-art summarization models.
We achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks.
arXiv Detail & Related papers (2023-05-22T14:53:45Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - Analysis on Image Set Visual Question Answering [0.3359875577705538]
We tackle the challenge of Visual Question Answering in multi-image setting.
Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image.
In this report, we work with 4 approaches in a bid to improve the performance on the task.
arXiv Detail & Related papers (2021-03-31T20:47:32Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.