Multi-Modal Fusion Transformer for Visual Question Answering in Remote
Sensing
- URL: http://arxiv.org/abs/2210.04510v1
- Date: Mon, 10 Oct 2022 09:20:33 GMT
- Title: Multi-Modal Fusion Transformer for Visual Question Answering in Remote
Sensing
- Authors: Tim Siebert, Kai Norman Clasen, Mahdyar Ravanbakhsh, Beg\"um Demir
- Abstract summary: VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information.
Most of the current fusion approaches use modality-specific representations in their fusion modules instead of joint representation learning.
We propose a multi-modal transformer-based architecture to overcome this issue.
- Score: 1.491109220586182
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the new generation of satellite technologies, the archives of remote
sensing (RS) images are growing very fast. To make the intrinsic information of
each RS image easily accessible, visual question answering (VQA) has been
introduced in RS. VQA allows a user to formulate a free-form question
concerning the content of RS images to extract generic information. It has been
shown that the fusion of the input modalities (i.e., image and text) is crucial
for the performance of VQA systems. Most of the current fusion approaches use
modality-specific representations in their fusion modules instead of joint
representation learning. However, to discover the underlying relation between
both the image and question modality, the model is required to learn the joint
representation instead of simply combining (e.g., concatenating, adding, or
multiplying) the modality-specific representations. We propose a multi-modal
transformer-based architecture to overcome this issue. Our proposed
architecture consists of three main modules: i) the feature extraction module
for extracting the modality-specific features; ii) the fusion module, which
leverages a user-defined number of multi-modal transformer layers of the
VisualBERT model (VB); and iii) the classification module to obtain the answer.
Experimental results obtained on the RSVQAxBEN and RSVQA-LR datasets (which are
made up of RGB bands of Sentinel-2 images) demonstrate the effectiveness of
VBFusion for VQA tasks in RS. To analyze the importance of using other spectral
bands for the description of the complex content of RS images in the framework
of VQA, we extend the RSVQAxBEN dataset to include all the spectral bands of
Sentinel-2 images with 10m and 20m spatial resolution.
Related papers
- Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering [26.8129265632403]
Current Remote Sensing Visual Question Answering (RSVQA) methods are limited by the imaging mechanisms of optical sensors.
We propose a Text-guided Coarse-to-Fine Fusion Network (TGFNet) to improve RSVQA performance.
We create the first large-scale benchmark dataset for evaluating optical-SAR RSVQA methods.
arXiv Detail & Related papers (2024-11-24T09:48:03Z) - Fine-grained Late-interaction Multi-modal Retrieval for Retrieval
Augmented Visual Question Answering [56.96857992123026]
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions.
This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA.
arXiv Detail & Related papers (2023-09-29T10:54:10Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - Mutual-Guided Dynamic Network for Image Fusion [51.615598671899335]
We propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs.
Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks.
arXiv Detail & Related papers (2023-08-24T03:50:37Z) - CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for
Multi-Modality Image Fusion [138.40422469153145]
We propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network.
We show that CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2022-11-26T02:40:28Z) - Few-Shot Learning Meets Transformer: Unified Query-Support Transformers
for Few-Shot Classification [16.757917001089762]
Few-shot classification aims to recognize unseen classes using very limited samples.
In this paper, we show that the two challenges can be well modeled simultaneously via a unified Query-Support TransFormer model.
Experiments on four popular datasets demonstrate the effectiveness and superiority of the proposed QSFormer.
arXiv Detail & Related papers (2022-08-26T01:53:23Z) - Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote
Sensing Image Retrieval [21.05804942940532]
Cross-modal text-image retrieval has attracted extensive attention for its advantages of flexible input and efficient query.
To cope with the problem of multi-scale scarcity and target redundancy in RS multimodal retrieval task, we come up with a novel asymmetric multimodal feature matching network (AMFMN)
Our model adapts to multi-scale feature inputs, favors multi-source retrieval methods, and can dynamically filter redundant features.
arXiv Detail & Related papers (2022-04-21T03:53:19Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - RGB-D Salient Object Detection with Cross-Modality Modulation and
Selection [126.4462739820643]
We present an effective method to progressively integrate and refine the cross-modality complementarities for RGB-D salient object detection (SOD)
The proposed network mainly solves two challenging issues: 1) how to effectively integrate the complementary information from RGB image and its corresponding depth map, and 2) how to adaptively select more saliency-related features.
arXiv Detail & Related papers (2020-07-14T14:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.