Visual Question Answering in Remote Sensing with Cross-Attention and
Multimodal Information Bottleneck
- URL: http://arxiv.org/abs/2306.14264v1
- Date: Sun, 25 Jun 2023 15:09:21 GMT
- Title: Visual Question Answering in Remote Sensing with Cross-Attention and
Multimodal Information Bottleneck
- Authors: Jayesh Songara, Shivam Pande, Shabnam Choudhury, Biplab Banerjee and
Rajbabu Velmurugan
- Abstract summary: We deal with the problem of visual question answering (VQA) in remote sensing.
While remotely sensed images contain information significant for the task of identification and object detection, they pose a great challenge in their processing because of high dimensionality, volume and redundancy.
We propose a cross attention based approach combined with information. The CNN-LSTM based cross-attention highlights the information in the image and language modalities and establishes a connection between the two, while information learns a low dimensional layer, that has all the relevant information required to carry out the VQA task.
- Score: 14.719648367178259
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this research, we deal with the problem of visual question answering (VQA)
in remote sensing. While remotely sensed images contain information significant
for the task of identification and object detection, they pose a great
challenge in their processing because of high dimensionality, volume and
redundancy. Furthermore, processing image information jointly with language
features adds additional constraints, such as mapping the corresponding image
and language features. To handle this problem, we propose a cross attention
based approach combined with information maximization. The CNN-LSTM based
cross-attention highlights the information in the image and language modalities
and establishes a connection between the two, while information maximization
learns a low dimensional bottleneck layer, that has all the relevant
information required to carry out the VQA task. We evaluate our method on two
VQA remote sensing datasets of different resolutions. For the high resolution
dataset, we achieve an overall accuracy of 79.11% and 73.87% for the two test
sets while for the low resolution dataset, we achieve an overall accuracy of
85.98%.
Related papers
- PGNeXt: High-Resolution Salient Object Detection via Pyramid Grafting Network [24.54269823691119]
We present an advanced study on more challenging high-resolution salient object detection (HRSOD) from both dataset and network framework perspectives.
To compensate for the lack of HRSOD dataset, we thoughtfully collect a large-scale high resolution salient object detection dataset, called UHRSD.
All the images are finely annotated in pixel-level, far exceeding previous low-resolution SOD datasets.
arXiv Detail & Related papers (2024-08-02T09:31:21Z) - Segmentation-guided Attention for Visual Question Answering from Remote Sensing Images [1.6932802756478726]
Visual Question Answering for Remote Sensing (RSVQA) is a task that aims at answering natural language questions about the content of a remote sensing image.
We propose to embed an attention mechanism guided by segmentation into a RSVQA pipeline.
We provide a new VQA dataset that exploits very high-resolution RGB orthophotos annotated with 16 segmentation classes and question/answer pairs.
arXiv Detail & Related papers (2024-07-11T16:59:32Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Object Detection in Hyperspectral Image via Unified Spectral-Spatial
Feature Aggregation [55.9217962930169]
We present S2ADet, an object detector that harnesses the rich spectral and spatial complementary information inherent in hyperspectral images.
S2ADet surpasses existing state-of-the-art methods, achieving robust and reliable results.
arXiv Detail & Related papers (2023-06-14T09:01:50Z) - Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network.
We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z) - LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution
Homography Estimation [52.63874513999119]
Cross-resolution image alignment is a key problem in multiscale giga photography.
Existing deep homography methods neglecting the explicit formulation of correspondences between them, which leads to degraded accuracy in cross-resolution challenges.
We propose a local transformer network embedded within a multiscale structure to explicitly learn correspondences between the multimodal inputs.
arXiv Detail & Related papers (2021-06-08T02:51:45Z) - TJU-DHD: A Diverse High-Resolution Dataset for Object Detection [48.94731638729273]
Large-scale, rich-diversity, and high-resolution datasets play an important role in developing better object detection methods.
We build a diverse high-resolution dataset (called TJU-DHD)
The dataset contains 115,354 high-resolution images and 709,330 labeled objects with a large variance in scale and appearance.
arXiv Detail & Related papers (2020-11-18T09:32:24Z) - Multi-image Super Resolution of Remotely Sensed Images using Residual
Feature Attention Deep Neural Networks [1.3764085113103222]
The presented research proposes a novel residual attention model (RAMS) that efficiently tackles the multi-image super-resolution task.
We introduce the mechanism of visual feature attention with 3D convolutions in order to obtain an aware data fusion and information extraction.
Our representation learning network makes extensive use of nestled residual connections to let flow redundant low-frequency signals.
arXiv Detail & Related papers (2020-07-06T22:54:02Z) - RSVQA: Visual Question Answering for Remote Sensing Data [6.473307489370171]
This paper introduces the task of visual question answering for remote sensing data (RSVQA)
We use questions formulated in natural language and use them to interact with the images.
The datasets can be used to train (when using supervised methods) and evaluate models to solve the RSVQA task.
arXiv Detail & Related papers (2020-03-16T17:09:31Z) - Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task.
We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network.
Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.