Related papers: From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data

From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data

URL: http://arxiv.org/abs/2205.03147v1
Date: Fri, 6 May 2022 11:37:00 GMT
Title: From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data
Authors: Zhenghang Yuan, Lichao Mou, Qi Wang, and Xiao Xiang Zhu
Abstract summary: Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system. No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation. There are questions with clearly different difficulty levels for each image in the RSVQA task. A multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features.
Score: 27.160303686163164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system. Although VQA in computer vision has been widely researched, VQA for remote sensing data (RSVQA) is still in its infancy. There are two characteristics that need to be specially considered for the RSVQA task. 1) No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation; 2) There are questions with clearly different difficulty levels for each image in the RSVQA task. Directly training a model with questions in a random order may confuse the model and limit the performance. To address these two problems, in this paper, a multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features. Besides, a self-paced curriculum learning (SPCL)-based VQA model is developed to train networks with samples in an easy-to-hard way. To be more specific, a language-guided SPCL method with a soft weighting strategy is explored in this work. The proposed model is evaluated on three public datasets, and extensive experimental results show that the proposed RSVQA framework can achieve promising performance.

Related papers

Visual Question Answering on Multiple Remote Sensing Image Modalities [1.6932802756478726]
In many fields such as remote sensing, the visual feature extraction step could benefit significantly from leveraging different image modalities.<n>We introduce a new VQA dataset, named TAMMI, with diverse questions on scenes described by three different modalities.<n>We also propose the MM-RSVQA model, based on VisualBERT, a vision-language transformer, to effectively combine the multiple image modalities and text.
arXiv Detail & Related papers (2025-05-21T11:42:47Z)
Large Vision-Language Models for Remote Sensing Visual Question Answering [0.0]
Remote Sensing Visual Question Answering (RSVQA) is a challenging task that involves interpreting complex satellite imagery to answer natural language questions. Traditional approaches often rely on separate visual feature extractors and language processing models, which can be computationally intensive and limited in their ability to handle open-ended questions. We propose a novel method that leverages a generative Large Vision-Language Model (LVLM) to streamline the RSVQA process.
arXiv Detail & Related papers (2024-11-16T18:32:38Z)
Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image. Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image. We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z)
NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. We leverage existing 3D detection annotations to generate scene graphs and design question templates manually. We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z)
Multilingual Augmentation for Robust Visual Question Answering in Remote Sensing Images [19.99615698375829]
We propose a contrastive learning strategy for training robust RSVQA models against diverse question templates and words. Experimental results demonstrate that the proposed augmented dataset is effective in improving the robustness of the RSVQA model.
arXiv Detail & Related papers (2023-04-07T21:06:58Z)
Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs) We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training. We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs. This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z)
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces. We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA) Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z)
How to find a good image-text embedding for remote sensing visual question answering? [41.0510495281302]
Visual question answering (VQA) has been introduced to remote sensing to make information extraction from overhead imagery more accessible to everyone. We study three different fusion methodologies in the context of VQA for remote sensing and analyse the gains in accuracy with respect to the model complexity.
arXiv Detail & Related papers (2021-09-24T09:48:28Z)
RSVQA: Visual Question Answering for Remote Sensing Data [6.473307489370171]
This paper introduces the task of visual question answering for remote sensing data (RSVQA) We use questions formulated in natural language and use them to interact with the images. The datasets can be used to train (when using supervised methods) and evaluate models to solve the RSVQA task.
arXiv Detail & Related papers (2020-03-16T17:09:31Z)
Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions. We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
In Defense of Grid Features for Visual Question Answering [65.71985794097426]
We revisit grid features for visual question answering (VQA) and find they can work surprisingly well. We verify that this observation holds true across different VQA models and generalizes well to other tasks like image captioning. We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training.
arXiv Detail & Related papers (2020-01-10T18:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.