From Easy to Hard: Learning Language-guided Curriculum for Visual
Question Answering on Remote Sensing Data
- URL: http://arxiv.org/abs/2205.03147v1
- Date: Fri, 6 May 2022 11:37:00 GMT
- Title: From Easy to Hard: Learning Language-guided Curriculum for Visual
Question Answering on Remote Sensing Data
- Authors: Zhenghang Yuan, Lichao Mou, Qi Wang, and Xiao Xiang Zhu
- Abstract summary: Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system.
No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation.
There are questions with clearly different difficulty levels for each image in the RSVQA task.
A multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features.
- Score: 27.160303686163164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual question answering (VQA) for remote sensing scene has great potential
in intelligent human-computer interaction system. Although VQA in computer
vision has been widely researched, VQA for remote sensing data (RSVQA) is still
in its infancy. There are two characteristics that need to be specially
considered for the RSVQA task. 1) No object annotations are available in RSVQA
datasets, which makes it difficult for models to exploit informative region
representation; 2) There are questions with clearly different difficulty levels
for each image in the RSVQA task. Directly training a model with questions in a
random order may confuse the model and limit the performance. To address these
two problems, in this paper, a multi-level visual feature learning method is
proposed to jointly extract language-guided holistic and regional image
features. Besides, a self-paced curriculum learning (SPCL)-based VQA model is
developed to train networks with samples in an easy-to-hard way. To be more
specific, a language-guided SPCL method with a soft weighting strategy is
explored in this work. The proposed model is evaluated on three public
datasets, and extensive experimental results show that the proposed RSVQA
framework can achieve promising performance.
Related papers
- Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - Multilingual Augmentation for Robust Visual Question Answering in Remote
Sensing Images [19.99615698375829]
We propose a contrastive learning strategy for training robust RSVQA models against diverse question templates and words.
Experimental results demonstrate that the proposed augmented dataset is effective in improving the robustness of the RSVQA model.
arXiv Detail & Related papers (2023-04-07T21:06:58Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets [5.45761450227064]
We propose a new Few-Shot Visual Question Generation (FS-VQG) task and provide a comprehensive benchmark to it.
We evaluate various existing VQG approaches as well as popular few-shot solutions based on meta-learning and self-supervised strategies for the FS-VQG task.
Several important findings emerge from our experiments, that shed light on the limits of current models in few-shot vision and language generation tasks.
arXiv Detail & Related papers (2022-10-13T15:01:15Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - How to find a good image-text embedding for remote sensing visual
question answering? [41.0510495281302]
Visual question answering (VQA) has been introduced to remote sensing to make information extraction from overhead imagery more accessible to everyone.
We study three different fusion methodologies in the context of VQA for remote sensing and analyse the gains in accuracy with respect to the model complexity.
arXiv Detail & Related papers (2021-09-24T09:48:28Z) - RSVQA: Visual Question Answering for Remote Sensing Data [6.473307489370171]
This paper introduces the task of visual question answering for remote sensing data (RSVQA)
We use questions formulated in natural language and use them to interact with the images.
The datasets can be used to train (when using supervised methods) and evaluate models to solve the RSVQA task.
arXiv Detail & Related papers (2020-03-16T17:09:31Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z) - In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well.
We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning.
We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.