Multimodal Informative ViT: Information Aggregation and Distribution for
Hyperspectral and LiDAR Classification
- URL: http://arxiv.org/abs/2401.03179v2
- Date: Tue, 23 Jan 2024 05:57:30 GMT
- Title: Multimodal Informative ViT: Information Aggregation and Distribution for
Hyperspectral and LiDAR Classification
- Authors: Jiaqing Zhang, Jie Lei, Weiying Xie, Geng Yang, Daixun Li, Yunsong Li
- Abstract summary: Multimodal Informative Vit (MIVit) is a system with an innovative information aggregate-distributing mechanism.
MIVit reduces redundancy in the empirical distribution of each modality's separate and fused features.
Our results show that MIVit's bidirectional aggregate-distributing mechanism is highly effective.
- Score: 25.254816993934746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In multimodal land cover classification (MLCC), a common challenge is the
redundancy in data distribution, where irrelevant information from multiple
modalities can hinder the effective integration of their unique features. To
tackle this, we introduce the Multimodal Informative Vit (MIVit), a system with
an innovative information aggregate-distributing mechanism. This approach
redefines redundancy levels and integrates performance-aware elements into the
fused representation, facilitating the learning of semantics in both forward
and backward directions. MIVit stands out by significantly reducing redundancy
in the empirical distribution of each modality's separate and fused features.
It employs oriented attention fusion (OAF) for extracting shallow local
features across modalities in horizontal and vertical dimensions, and a
Transformer feature extractor for extracting deep global features through
long-range attention. We also propose an information aggregation constraint
(IAC) based on mutual information, designed to remove redundant information and
preserve complementary information within embedded features. Additionally, the
information distribution flow (IDF) in MIVit enhances performance-awareness by
distributing global classification information across different modalities'
feature maps. This architecture also addresses missing modality challenges with
lightweight independent modality classifiers, reducing the computational load
typically associated with Transformers. Our results show that MIVit's
bidirectional aggregate-distributing mechanism between modalities is highly
effective, achieving an average overall accuracy of 95.56% across three
multimodal datasets. This performance surpasses current state-of-the-art
methods in MLCC. The code for MIVit is accessible at
https://github.com/icey-zhang/MIViT.
Related papers
- WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification [8.88666439137662]
We introduce the Wide-Ranging Information Mining Network (WRIM-Net), which mainly comprises a Multi-dimension Interactive Information Mining (MIIM) module and an Auxiliary-Information-based Contrastive Learning (AICL) approach.
Thanks to the low computational complexity design, separate MIIM can be positioned in shallow layers, enabling the network to better mine specific-modality multi-dimension information.
We conduct extensive experiments not only on the well-known SYSU-MM01 and RegDB datasets but also on the latest large-scale cross-modality LLCM dataset.
arXiv Detail & Related papers (2024-08-20T08:06:16Z) - Modality Prompts for Arbitrary Modality Salient Object Detection [57.610000247519196]
This paper delves into the task of arbitrary modality salient object detection (AM SOD)
It aims to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images.
A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD.
arXiv Detail & Related papers (2024-05-06T11:02:02Z) - NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild.
NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities.
We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z) - Modality-Collaborative Transformer with Hybrid Feature Reconstruction
for Robust Emotion Recognition [35.15390769958969]
We propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR)
MCT-HFR consists of a novel attention-based encoder which concurrently extracts and dynamically balances the intra- and inter-modality relations.
During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise complete and incomplete representations.
arXiv Detail & Related papers (2023-12-26T01:59:23Z) - Multi-scale Semantic Correlation Mining for Visible-Infrared Person
Re-Identification [19.49945790485511]
MSCMNet is proposed to comprehensively exploit semantic features at multiple scales.
It simultaneously reduces modality information loss as small as possible in feature extraction.
Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets demonstrate that the proposed MSCMNet achieves the greatest accuracy.
arXiv Detail & Related papers (2023-11-24T10:23:57Z) - Factorized Contrastive Learning: Going Beyond Multi-view Redundancy [116.25342513407173]
This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy.
On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-06-08T15:17:04Z) - Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal
and Multimodal Representations [27.855467591358018]
We introduce the multimodal information bottleneck (MIB), aiming to learn a powerful and sufficient multimodal representation.
We develop three MIB variants, namely, early-fusion MIB, late-fusion MIB, and complete MIB, to focus on different perspectives of information constraints.
Experimental results suggest that the proposed method reaches state-of-the-art performance on the tasks of multimodal sentiment analysis and multimodal emotion recognition.
arXiv Detail & Related papers (2022-10-31T16:14:18Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - Multi-modal land cover mapping of remote sensing images using pyramid
attention and gated fusion networks [20.66034058363032]
We propose a new multi-modality network for land cover mapping of multi-modal remote sensing data based on a novel pyramid attention fusion (PAF) module and a gated fusion unit (GFU)
PAF module is designed to efficiently obtain rich fine-grained contextual representations from each modality with a built-in cross-level and cross-view attention fusion mechanism.
GFU module utilizes a novel gating mechanism for early merging of features, thereby diminishing hidden redundancies and noise.
arXiv Detail & Related papers (2021-11-06T10:01:01Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Deep Multimodal Fusion by Channel Exchanging [87.40768169300898]
This paper proposes a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities.
The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network.
arXiv Detail & Related papers (2020-11-10T09:53:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.