Related papers: Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing

Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing

URL: http://arxiv.org/abs/2210.04510v1
Date: Mon, 10 Oct 2022 09:20:33 GMT
Title: Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
Authors: Tim Siebert, Kai Norman Clasen, Mahdyar Ravanbakhsh, Beg\"um Demir
Abstract summary: VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information. Most of the current fusion approaches use modality-specific representations in their fusion modules instead of joint representation learning. We propose a multi-modal transformer-based architecture to overcome this issue.
Score: 1.491109220586182
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the new generation of satellite technologies, the archives of remote sensing (RS) images are growing very fast. To make the intrinsic information of each RS image easily accessible, visual question answering (VQA) has been introduced in RS. VQA allows a user to formulate a free-form question concerning the content of RS images to extract generic information. It has been shown that the fusion of the input modalities (i.e., image and text) is crucial for the performance of VQA systems. Most of the current fusion approaches use modality-specific representations in their fusion modules instead of joint representation learning. However, to discover the underlying relation between both the image and question modality, the model is required to learn the joint representation instead of simply combining (e.g., concatenating, adding, or multiplying) the modality-specific representations. We propose a multi-modal transformer-based architecture to overcome this issue. Our proposed architecture consists of three main modules: i) the feature extraction module for extracting the modality-specific features; ii) the fusion module, which leverages a user-defined number of multi-modal transformer layers of the VisualBERT model (VB); and iii) the classification module to obtain the answer. Experimental results obtained on the RSVQAxBEN and RSVQA-LR datasets (which are made up of RGB bands of Sentinel-2 images) demonstrate the effectiveness of VBFusion for VQA tasks in RS. To analyze the importance of using other spectral bands for the description of the complex content of RS images in the framework of VQA, we extend the RSVQAxBEN dataset to include all the spectral bands of Sentinel-2 images with 10m and 20m spatial resolution.

Related papers

RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity. RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning. We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z)
Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering [26.8129265632403]
Current Remote Sensing Visual Question Answering (RSVQA) methods are limited by the imaging mechanisms of optical sensors. We propose a Text-guided Coarse-to-Fine Fusion Network (TGFNet) to improve RSVQA performance. We create the first large-scale benchmark dataset for evaluating optical-SAR RSVQA methods.
arXiv Detail & Related papers (2024-11-24T09:48:03Z)
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering [56.96857992123026]
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA.
arXiv Detail & Related papers (2023-09-29T10:54:10Z)
Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem. By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts. Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z)
Mutual-Guided Dynamic Network for Image Fusion [51.615598671899335]
We propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs. Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks.
arXiv Detail & Related papers (2023-08-24T03:50:37Z)
CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion [138.40422469153145]
We propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network. We show that CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2022-11-26T02:40:28Z)
Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification [16.757917001089762]
Few-shot classification aims to recognize unseen classes using very limited samples. In this paper, we show that the two challenges can be well modeled simultaneously via a unified Query-Support TransFormer model. Experiments on four popular datasets demonstrate the effectiveness and superiority of the proposed QSFormer.
arXiv Detail & Related papers (2022-08-26T01:53:23Z)
Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval [21.05804942940532]
Cross-modal text-image retrieval has attracted extensive attention for its advantages of flexible input and efficient query. To cope with the problem of multi-scale scarcity and target redundancy in RS multimodal retrieval task, we come up with a novel asymmetric multimodal feature matching network (AMFMN) Our model adapts to multi-scale feature inputs, favors multi-source retrieval methods, and can dynamically filter redundant features.
arXiv Detail & Related papers (2022-04-21T03:53:19Z)
Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities. We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement. Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z)
RGB-D Salient Object Detection with Cross-Modality Modulation and Selection [126.4462739820643]
We present an effective method to progressively integrate and refine the cross-modality complementarities for RGB-D salient object detection (SOD) The proposed network mainly solves two challenging issues: 1) how to effectively integrate the complementary information from RGB image and its corresponding depth map, and 2) how to adaptively select more saliency-related features.
arXiv Detail & Related papers (2020-07-14T14:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.