Related papers: HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images

HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images

URL: http://arxiv.org/abs/2301.09460v1
Date: Mon, 23 Jan 2023 14:36:38 GMT
Title: HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images
Authors: Kun Li, George Vosselman, Michael Ying Yang
Abstract summary: We introduce a new dataset, HRVQA, which provides collected 53512 aerial images of 1024*1024 pixels and 1070240 QA pairs. To benchmark the understanding capability of VQA models for aerial images, we evaluate the relevant methods on HRVQA. Our method achieves superior performance in comparison to the previous state-of-the-art approaches.
Score: 18.075338835513993
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual question answering (VQA) is an important and challenging multimodal task in computer vision. Recently, a few efforts have been made to bring VQA task to aerial images, due to its potential real-world applications in disaster monitoring, urban planning, and digital earth product generation. However, not only the huge variation in the appearance, scale and orientation of the concepts in aerial images, but also the scarcity of the well-annotated datasets restricts the development of VQA in this domain. In this paper, we introduce a new dataset, HRVQA, which provides collected 53512 aerial images of 1024*1024 pixels and semi-automatically generated 1070240 QA pairs. To benchmark the understanding capability of VQA models for aerial images, we evaluate the relevant methods on HRVQA. Moreover, we propose a novel model, GFTransformer, with gated attention modules and a mutual fusion module. The experiments show that the proposed dataset is quite challenging, especially the specific attribute related questions. Our method achieves superior performance in comparison to the previous state-of-the-art approaches. The dataset and the source code will be released at https://hrvqa.nl/.

Related papers

SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset [11.729464930866483]
"SimpsonsVQA" is a novel dataset for VQA derived from The Simpsons TV show. It is designed to address not only the traditional VQA task but also to identify irrelevant questions related to images. SimpsonsVQA contains approximately 23K images, 166K QA pairs, and 500K judgments.
arXiv Detail & Related papers (2024-10-30T02:30:40Z)
Descriptive Image Quality Assessment in the Wild [25.503311093471076]
VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression. We introduce Depicted image Quality Assessment in the Wild (DepictQA-Wild) Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios.
arXiv Detail & Related papers (2024-05-29T07:49:15Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA. We first augment the existing data via deliberate perturbations on either the image or question. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z)
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! [103.09776737512077]
SelTDA (Self-Taught Data Augmentation) is a strategy for finetuning large vision language models on small-scale VQA datasets. It generates question-answer pseudolabels directly conditioned on an image, allowing us to pseudolabel unlabeled images. We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions.
arXiv Detail & Related papers (2023-06-06T18:00:47Z)
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. We leverage existing 3D detection annotations to generate scene graphs and design question templates manually. We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z)
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery. We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z)
OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese [2.7528170226206443]
We introduce the OpenViVQA dataset, the first large-scale dataset for visual question answering in Vietnamese. The dataset consists of 11,000+ images associated with 37,000+ question-answer pairs (QAs) Our proposed methods achieve results that are competitive with SOTA models such as SAAA, MCAN, LORA, and M4C.
arXiv Detail & Related papers (2023-05-07T03:59:31Z)
COIN: Counterfactual Image Generation for VQA Interpretation [5.994412766684842]
We introduce an interpretability approach for VQA models by generating counterfactual images. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
arXiv Detail & Related papers (2022-01-10T13:51:35Z)
Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples. We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z)
Visual Question Answering on 360{\deg} Images [96.00046925811515]
VQA 360 is a novel task of visual question answering on 360 images. We collect the first VQA 360 dataset, containing around 17,000 real-world image-question-answer triplets for a variety of question types.
arXiv Detail & Related papers (2020-01-10T08:18:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.