Fus-MAE: A cross-attention-based data fusion approach for Masked
Autoencoders in remote sensing
- URL: http://arxiv.org/abs/2401.02764v1
- Date: Fri, 5 Jan 2024 11:36:21 GMT
- Title: Fus-MAE: A cross-attention-based data fusion approach for Masked
Autoencoders in remote sensing
- Authors: Hugo Chan-To-Hing, Bharadwaj Veeravalli
- Abstract summary: Fus-MAE is a self-supervised learning framework based on masked autoencoders.
Our empirical findings demonstrate that Fus-MAE can effectively compete with contrastive learning strategies tailored for SAR-optical data fusion.
- Score: 5.990692497580643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised frameworks for representation learning have recently stirred
up interest among the remote sensing community, given their potential to
mitigate the high labeling costs associated with curating large satellite image
datasets. In the realm of multimodal data fusion, while the often used
contrastive learning methods can help bridging the domain gap between different
sensor types, they rely on data augmentations techniques that require expertise
and careful design, especially for multispectral remote sensing data. A
possible but rather scarcely studied way to circumvent these limitations is to
use a masked image modelling based pretraining strategy. In this paper, we
introduce Fus-MAE, a self-supervised learning framework based on masked
autoencoders that uses cross-attention to perform early and feature-level data
fusion between synthetic aperture radar and multispectral optical data - two
modalities with a significant domain gap. Our empirical findings demonstrate
that Fus-MAE can effectively compete with contrastive learning strategies
tailored for SAR-optical data fusion and outperforms other masked-autoencoders
frameworks trained on a larger corpus.
Related papers
- MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning [3.520960737058199]
We introduce Multimodal Masked Autoenco-Based One-Shot Learning (Mu-MAE)
Mu-MAE integrates a multimodal masked autoencoder with a synchronized masking strategy tailored for wearable sensors.
It achieves up to an 80.17% accuracy five-way one-shot multimodal classification for classification without the use of additional data.
arXiv Detail & Related papers (2024-08-08T06:16:00Z) - MaeFuse: Transferring Omni Features with Pretrained Masked Autoencoders for Infrared and Visible Image Fusion via Guided Training [57.18758272617101]
MaeFuse is a novel autoencoder model designed for infrared and visible image fusion (IVIF)
Our model utilizes a pretrained encoder from Masked Autoencoders (MAE), which facilities the omni features extraction for low-level reconstruction and high-level vision tasks.
MaeFuse not only introduces a novel perspective in the realm of fusion techniques but also stands out with impressive performance across various public datasets.
arXiv Detail & Related papers (2024-04-17T02:47:39Z) - Towards Efficient Information Fusion: Concentric Dual Fusion Attention Based Multiple Instance Learning for Whole Slide Images [2.428210413498989]
We introduce the Concentric Dual Fusion Attention-MIL (CDFA-MIL) framework.
CDFA-MIL combines point-to-area feature-colum attention and point-to-point concentric-row attention using concentric patch.
Its application has demonstrated exceptional performance, significantly surpassing existing MIL methods in accuracy and F1 scores on prominent datasets.
arXiv Detail & Related papers (2024-03-21T12:23:29Z) - Efficient Multi-Resolution Fusion for Remote Sensing Data with Label
Uncertainty [0.7832189413179361]
This paper presents a new method for fusing multi-modal and multi-resolution remote sensor data without requiring pixel-level training labels.
We propose a new method based on binary fuzzy measures, which reduces the search space and significantly improves the efficiency of the MIMRF framework.
arXiv Detail & Related papers (2024-02-07T17:34:32Z) - Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images [1.662438436885552]
Multi-modal fusion has been determined to enhance the accuracy by fusing data from multiple modalities.
We propose a novel multi-modal fusion strategy for mapping relationships between different channels at the early stage.
By addressing fusion in the early stage, as opposed to mid or late-stage methods, our method achieves competitive and even superior performance compared to existing techniques.
arXiv Detail & Related papers (2023-10-21T00:56:11Z) - Enhancing Cross-Dataset Performance of Distracted Driving Detection With
Score-Softmax Classifier [7.302402275736439]
Deep neural networks enable real-time monitoring of in-vehicle driver, facilitating the timely prediction of distractions, fatigue, and potential hazards.
Recent research has exposed unreliable cross-dataset end-to-end driver behavior recognition due to overfitting.
We introduce the Score-Softmax classifier, which addresses this issue by enhancing inter-class independence and Intra-class uncertainty.
arXiv Detail & Related papers (2023-10-08T15:28:01Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - Target-aware Dual Adversarial Learning and a Multi-scenario
Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection [65.30079184700755]
This study addresses the issue of fusing infrared and visible images that appear differently for object detection.
Previous approaches discover commons underlying the two modalities and fuse upon the common space either by iterative optimization or deep networks.
This paper proposes a bilevel optimization formulation for the joint problem of fusion and detection, and then unrolls to a target-aware Dual Adversarial Learning (TarDAL) network for fusion and a commonly used detection network.
arXiv Detail & Related papers (2022-03-30T11:44:56Z) - MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake
Detection [80.83725644958633]
Current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos.
We present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation.
arXiv Detail & Related papers (2021-09-15T14:11:53Z) - X-ModalNet: A Semi-Supervised Deep Cross-Modal Network for
Classification of Remote Sensing Data [69.37597254841052]
We propose a novel cross-modal deep-learning framework called X-ModalNet.
X-ModalNet generalizes well, owing to propagating labels on an updatable graph constructed by high-level features on the top of the network.
We evaluate X-ModalNet on two multi-modal remote sensing datasets (HSI-MSI and HSI-SAR) and achieve a significant improvement in comparison with several state-of-the-art methods.
arXiv Detail & Related papers (2020-06-24T15:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.