Related papers: RSVQA: Visual Question Answering for Remote Sensing Data

RSVQA: Visual Question Answering for Remote Sensing Data

URL: http://arxiv.org/abs/2003.07333v2
Date: Thu, 14 May 2020 14:05:28 GMT
Title: RSVQA: Visual Question Answering for Remote Sensing Data
Authors: Sylvain Lobry, Diego Marcos, Jesse Murray, Devis Tuia
Abstract summary: This paper introduces the task of visual question answering for remote sensing data (RSVQA) We use questions formulated in natural language and use them to interact with the images. The datasets can be used to train (when using supervised methods) and evaluate models to solve the RSVQA task.
Score: 6.473307489370171
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces the task of visual question answering for remote sensing data (RSVQA). Remote sensing images contain a wealth of information which can be useful for a wide range of tasks including land cover classification, object counting or detection. However, most of the available methodologies are task-specific, thus inhibiting generic and easy access to the information contained in remote sensing data. As a consequence, accurate remote sensing product generation still requires expert knowledge. With RSVQA, we propose a system to extract information from remote sensing data that is accessible to every user: we use questions formulated in natural language and use them to interact with the images. With the system, images can be queried to obtain high level information specific to the image content or relational dependencies between objects visible in the images. Using an automatic method introduced in this article, we built two datasets (using low and high resolution data) of image/question/answer triplets. The information required to build the questions and answers is queried from OpenStreetMap (OSM). The datasets can be used to train (when using supervised methods) and evaluate models to solve the RSVQA task. We report the results obtained by applying a model based on Convolutional Neural Networks (CNNs) for the visual part and on a Recurrent Neural Network (RNN) for the natural language part to this task. The model is trained on the two datasets, yielding promising results in both cases.

Related papers

Visual Question Answering on Multiple Remote Sensing Image Modalities [1.6932802756478726]
In many fields such as remote sensing, the visual feature extraction step could benefit significantly from leveraging different image modalities.<n>We introduce a new VQA dataset, named TAMMI, with diverse questions on scenes described by three different modalities.<n>We also propose the MM-RSVQA model, based on VisualBERT, a vision-language transformer, to effectively combine the multiple image modalities and text.
arXiv Detail & Related papers (2025-05-21T11:42:47Z)
A Comprehensive Survey on Visual Question Answering Datasets and Algorithms [1.941892373913038]
We meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category. We explore six main paradigms of VQA models: fusion, attention, the technique of using information from one modality to filter information from another, external knowledge base, composition or reasoning, and graph models.
arXiv Detail & Related papers (2024-11-17T18:52:06Z)
GeoChat: Grounded Large Vision-Language Model for Remote Sensing [65.78360056991247]
We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images. Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue. GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
arXiv Detail & Related papers (2023-11-24T18:59:10Z)
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese. We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA. We first augment the existing data via deliberate perturbations on either the image or question. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z)
Visual Question Answering in Remote Sensing with Cross-Attention and Multimodal Information Bottleneck [14.719648367178259]
We deal with the problem of visual question answering (VQA) in remote sensing. While remotely sensed images contain information significant for the task of identification and object detection, they pose a great challenge in their processing because of high dimensionality, volume and redundancy. We propose a cross attention based approach combined with information. The CNN-LSTM based cross-attention highlights the information in the image and language modalities and establishes a connection between the two, while information learns a low dimensional layer, that has all the relevant information required to carry out the VQA task.
arXiv Detail & Related papers (2023-06-25T15:09:21Z)
AVIS: Autonomous Visual Information Seeking with Large Language Model Agent [123.75169211547149]
We propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools. AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.
arXiv Detail & Related papers (2023-06-13T20:50:22Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
From Easy to Hard: Learning Language-guided Curriculum for Visual Question Answering on Remote Sensing Data [27.160303686163164]
Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system. No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation. There are questions with clearly different difficulty levels for each image in the RSVQA task. A multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features.
arXiv Detail & Related papers (2022-05-06T11:37:00Z)
How to find a good image-text embedding for remote sensing visual question answering? [41.0510495281302]
Visual question answering (VQA) has been introduced to remote sensing to make information extraction from overhead imagery more accessible to everyone. We study three different fusion methodologies in the context of VQA for remote sensing and analyse the gains in accuracy with respect to the model complexity.
arXiv Detail & Related papers (2021-09-24T09:48:28Z)
Learning Compositional Representation for Few-shot Visual Question Answering [93.4061107793983]
Current methods of Visual Question Answering perform well on the answers with an amount of training data but have limited accuracy on the novel ones with few examples. We propose to extract the attributes from the answers with enough data, which are later composed to constrain the learning of the few-shot ones. Experimental results on the VQA v2.0 validation dataset demonstrate the effectiveness of our proposed attribute network.
arXiv Detail & Related papers (2021-02-21T10:16:24Z)
Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation. We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.