MMFormer: Multimodal Transformer Using Multiscale Self-Attention for
Remote Sensing Image Classification
- URL: http://arxiv.org/abs/2303.13101v1
- Date: Thu, 23 Mar 2023 08:34:24 GMT
- Title: MMFormer: Multimodal Transformer Using Multiscale Self-Attention for
Remote Sensing Image Classification
- Authors: Bo Zhang, Zuheng Ming, Wei Feng, Yaqian Liu, Liang He, Kaixing Zhao
- Abstract summary: We introduce a new Multimodal Transformer (MMFormer) for Remote Sensing (RS) image classification using Hyperspectral Image (HSI) and another source of data such as Light Detection and Ranging (LiDAR)
Compared with traditional Vision Transformer (ViT) lacking inductive biases of convolutions, we first introduce convolutional layers to our MMFormer to tokenize patches from multimodal data of HSI and LiDAR.
The proposed MSMHSA module can incorporate HSI to LiDAR data in a coarse-to-fine manner enabling us to learn a fine-grained representation.
- Score: 16.031616431297213
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: To benefit the complementary information between heterogeneous data, we
introduce a new Multimodal Transformer (MMFormer) for Remote Sensing (RS) image
classification using Hyperspectral Image (HSI) accompanied by another source of
data such as Light Detection and Ranging (LiDAR). Compared with traditional
Vision Transformer (ViT) lacking inductive biases of convolutions, we first
introduce convolutional layers to our MMFormer to tokenize patches from
multimodal data of HSI and LiDAR. Then we propose a Multi-scale Multi-head
Self-Attention (MSMHSA) module to address the problem of compatibility which
often limits to fuse HSI with high spectral resolution and LiDAR with
relatively low spatial resolution. The proposed MSMHSA module can incorporate
HSI to LiDAR data in a coarse-to-fine manner enabling us to learn a
fine-grained representation. Extensive experiments on widely used benchmarks
(e.g., Trento and MUUFL) demonstrate the effectiveness and superiority of our
proposed MMFormer for RS image classification.
Related papers
- A Sinkhorn Regularized Adversarial Network for Image Guided DEM Super-resolution using Frequency Selective Hybrid Graph Transformer [4.383449961857098]
Digital Elevation Model (DEM) is an essential aspect in the remote sensing (RS) domain to analyze various applications related to surface elevations.
Here, we address the generation of high-resolution (HR) DEMs using HR multi-spectral (MX) satellite imagery as a guide.
We present a novel adversarial objective related to optimizing Sinkhorn distance with classical GAN.
arXiv Detail & Related papers (2024-09-21T16:59:08Z) - Large Language Models for Multimodal Deformable Image Registration [50.91473745610945]
We propose a novel coarse-to-fine MDIR framework,LLM-Morph, for aligning the deep features from different modal medical images.
Specifically, we first utilize a CNN encoder to extract deep visual features from cross-modal image pairs, then we use the first adapter to adjust these tokens, and use LoRA in pre-trained LLMs to fine-tune their weights.
Third, for the alignment of tokens, we utilize other four adapters to transform the LLM-encoded tokens into multi-scale visual features, generating multi-scale deformation fields and facilitating the coarse-to-fine MDIR task
arXiv Detail & Related papers (2024-08-20T09:58:30Z) - GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization [21.846935203845728]
Local manipulation pipeline is designed, incorporating the powerful SAM, ChatGPT and generative models.
The GIM dataset has the following advantages: 1) Large scale, including over one million pairs of AI-manipulated images and real images.
We propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial Block (FSB), and a Multi-window Anomalous Modelling (MWAM) Module.
arXiv Detail & Related papers (2024-06-24T11:10:41Z) - Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution [49.902047563260496]
We develop the first attempt to integrate the Vision State Space Model (Mamba) for remote sensing image (RSI) super-resolution.
To achieve better SR reconstruction, building upon Mamba, we devise a Frequency-assisted Mamba framework, dubbed FMSR.
Our FMSR features a multi-level fusion architecture equipped with the Frequency Selection Module (FSM), Vision State Space Module (VSSM), and Hybrid Gate Module (HGM)
arXiv Detail & Related papers (2024-05-08T11:09:24Z) - SDR-Former: A Siamese Dual-Resolution Transformer for Liver Lesion
Classification Using 3D Multi-Phase Imaging [59.78761085714715]
This study proposes a novel Siamese Dual-Resolution Transformer (SDR-Former) framework for liver lesion classification.
The proposed framework has been validated through comprehensive experiments on two clinical datasets.
To support the scientific community, we are releasing our extensive multi-phase MR dataset for liver lesion analysis to the public.
arXiv Detail & Related papers (2024-02-27T06:32:56Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - TFormer: A throughout fusion transformer for multi-modal skin lesion
diagnosis [6.899641625551976]
We introduce a pure transformer-based method, which we refer to as Throughout Fusion Transformer (TFormer)", for sufficient information intergration in MSLD.
We then carefully design a stack of dual-branch hierarchical multi-modal transformer (HMT) blocks to fuse information across different image modalities in a stage-by-stage way.
Our TFormer outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2022-11-21T12:07:05Z) - Decoupled-and-Coupled Networks: Self-Supervised Hyperspectral Image
Super-Resolution with Subpixel Fusion [67.35540259040806]
We propose a subpixel-level HS super-resolution framework by devising a novel decoupled-and-coupled network, called DCNet.
As the name suggests, DC-Net first decouples the input into common (or cross-sensor) and sensor-specific components.
We append a self-supervised learning module behind the CSU net by guaranteeing the material consistency to enhance the detailed appearances of the restored HS product.
arXiv Detail & Related papers (2022-05-07T23:40:36Z) - Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction [138.04956118993934]
We propose a novel Transformer-based method, coarse-to-fine sparse Transformer (CST)
CST embedding HSI sparsity into deep learning for HSI reconstruction.
In particular, CST uses our proposed spectra-aware screening mechanism (SASM) for coarse patch selecting. Then the selected patches are fed into our customized spectra-aggregation hashing multi-head self-attention (SAH-MSA) for fine pixel clustering and self-similarity capturing.
arXiv Detail & Related papers (2022-03-09T16:17:47Z) - Boosting Image Super-Resolution Via Fusion of Complementary Information
Captured by Multi-Modal Sensors [21.264746234523678]
Image Super-Resolution (SR) provides a promising technique to enhance the image quality of low-resolution optical sensors.
In this paper, we attempt to leverage complementary information from a low-cost channel (visible/depth) to boost image quality of an expensive channel (thermal) using fewer parameters.
arXiv Detail & Related papers (2020-12-07T02:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.