CoMiX: Cross-Modal Fusion with Deformable Convolutions for HSI-X Semantic Segmentation
- URL: http://arxiv.org/abs/2411.09023v1
- Date: Wed, 13 Nov 2024 21:00:28 GMT
- Title: CoMiX: Cross-Modal Fusion with Deformable Convolutions for HSI-X Semantic Segmentation
- Authors: Xuming Zhang, Xingfa Gu, Qingjiu Tian, Lorenzo Bruzzone,
- Abstract summary: CoMiX is an asymmetric encoder-decoder architecture with deformable convolutions (DCNs) for HSI-X semantic segmentation.
CoMiX is designed to extract, calibrate, and fuse information from HSI and X data.
- Score: 10.26122715098048
- License:
- Abstract: Improving hyperspectral image (HSI) semantic segmentation by exploiting complementary information from a supplementary data type (referred to X-modality) is promising but challenging due to differences in imaging sensors, image content, and resolution. Current techniques struggle to enhance modality-specific and modality-shared information, as well as to capture dynamic interaction and fusion between different modalities. In response, this study proposes CoMiX, an asymmetric encoder-decoder architecture with deformable convolutions (DCNs) for HSI-X semantic segmentation. CoMiX is designed to extract, calibrate, and fuse information from HSI and X data. Its pipeline includes an encoder with two parallel and interacting backbones and a lightweight all-multilayer perceptron (ALL-MLP) decoder. The encoder consists of four stages, each incorporating 2D DCN blocks for the X model to accommodate geometric variations and 3D DCN blocks for HSIs to adaptively aggregate spatial-spectral features. Additionally, each stage includes a Cross-Modality Feature enhancement and eXchange (CMFeX) module and a feature fusion module (FFM). CMFeX is designed to exploit spatial-spectral correlations from different modalities to recalibrate and enhance modality-specific and modality-shared features while adaptively exchanging complementary information between them. Outputs from CMFeX are fed into the FFM for fusion and passed to the next stage for further information learning. Finally, the outputs from each FFM are integrated by the ALL-MLP decoder for final prediction. Extensive experiments demonstrate that our CoMiX achieves superior performance and generalizes well to various multimodal recognition tasks. The CoMiX code will be released.
Related papers
- CDXFormer: Boosting Remote Sensing Change Detection with Extended Long Short-Term Memory [3.119836924407993]
We propose CDXFormer, with a core component that is a powerful XLSTM-based spatial enhancement layer.
We introduce a scale-specific Feature Enhancer layer, incorporating a Cross-Temporal Global Perceptron customized for semantic-accurate deep features.
We also propose a Cross-Scale Interactive Fusion module to progressively interact global change representations with responses.
arXiv Detail & Related papers (2024-11-12T15:22:14Z) - FIF-UNet: An Efficient UNet Using Feature Interaction and Fusion for Medical Image Segmentation [5.510679875888542]
A novel U-shaped model, called FIF-UNet, is proposed to address the above issue, including three plug-and-play modules.
Experiments on the Synapse and ACDC datasets demonstrate that the proposed FIF-UNet outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2024-09-09T04:34:47Z) - StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers.
We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding.
Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z) - Local-to-Global Cross-Modal Attention-Aware Fusion for HSI-X Semantic Segmentation [19.461033552684576]
We propose a Local-to-Global Cross-modal Attention-aware Fusion (LoGoCAF) framework for HSI-X classification.
LoGoCAF adopts a pixel-to-pixel two-branch semantic segmentation architecture to learn information from HSI and X modalities.
arXiv Detail & Related papers (2024-06-25T16:12:20Z) - P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation [8.46409964236009]
Diffusion models and multi-scale features are essential components in semantic segmentation tasks.
We propose a new model for semantic segmentation known as the diffusion model with parallel multi-scale branches.
Our model demonstrates superior performance based on the J1 metric on both the UAVid and Vaihingen Building datasets.
arXiv Detail & Related papers (2024-05-30T19:40:08Z) - Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD)
It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images.
A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z) - Fusion-Mamba for Cross-modality Object Detection [63.56296480951342]
Cross-modality fusing information from different modalities effectively improves object detection performance.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction.
Our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets.
arXiv Detail & Related papers (2024-04-14T05:28:46Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Xformer: Hybrid X-Shaped Transformer for Image Denoising [114.37510775636811]
We present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks.
Xformer achieves state-of-the-art performance on the synthetic and real-world image denoising tasks.
arXiv Detail & Related papers (2023-03-11T16:32:09Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Two Headed Dragons: Multimodal Fusion and Cross Modal Transactions [14.700807572189412]
We propose a novel transformer based fusion method for HSI and LiDAR modalities.
The model is composed of stacked auto encoders that harness the cross key-value pairs for HSI and LiDAR.
We test our model on Houston (Data Fusion Contest - 2013) and MUUFL Gulfport datasets and achieve competitive results.
arXiv Detail & Related papers (2021-07-24T11:33:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.