MSAF: Multimodal Split Attention Fusion
- URL: http://arxiv.org/abs/2012.07175v1
- Date: Sun, 13 Dec 2020 22:42:41 GMT
- Title: MSAF: Multimodal Split Attention Fusion
- Authors: Lang Su, Chuqing Hu, Guofa Li, Dongpu Cao
- Abstract summary: We propose a novel multimodal fusion module that learns to emphasize more contributive features across all modalities.
Our approach achieves competitive results in each task and outperforms other application-specific networks and multimodal fusion benchmarks.
- Score: 6.460517449962825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal learning mimics the reasoning process of the human multi-sensory
system, which is used to perceive the surrounding world. While making a
prediction, the human brain tends to relate crucial cues from multiple sources
of information. In this work, we propose a novel multimodal fusion module that
learns to emphasize more contributive features across all modalities.
Specifically, the proposed Multimodal Split Attention Fusion (MSAF) module
splits each modality into channel-wise equal feature blocks and creates a joint
representation that is used to generate soft attention for each channel across
the feature blocks. Further, the MSAF module is designed to be compatible with
features of various spatial dimensions and sequence lengths, suitable for both
CNNs and RNNs. Thus, MSAF can be easily added to fuse features of any unimodal
networks and utilize existing pretrained unimodal model weights. To demonstrate
the effectiveness of our fusion module, we design three multimodal networks
with MSAF for emotion recognition, sentiment analysis, and action recognition
tasks. Our approach achieves competitive results in each task and outperforms
other application-specific networks and multimodal fusion benchmarks.
Related papers
- Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation [8.443065903814821]
This study introduces a novel Multimodal Adapter-based Network (MANet) for multimodal remote sensing semantic segmentation.
At the core of this approach is the development of a Multimodal Adapter (MMAdapter), which fine-tunes SAM's image encoder to effectively leverage the model's general knowledge for multimodal data.
This work not only introduces a novel network for multimodal fusion, but also demonstrates, for the first time, SAM's powerful generalization capabilities with Digital Surface Model (DSM) data.
arXiv Detail & Related papers (2024-10-15T00:52:16Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding [7.329728566839757]
We propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF)
MoPE-BAF is a novel multi-modal soft prompt framework based on the unified vision-language model (VLM)
arXiv Detail & Related papers (2024-03-17T19:12:26Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - MMSFormer: Multimodal Transformer for Material and Semantic Segmentation [16.17270247327955]
We propose a novel fusion strategy that can effectively fuse information from different modality combinations.
We also propose a new model named Multi-Modal TransFormer (MMSFormer) that incorporates the proposed fusion strategy.
MMSFormer outperforms current state-of-the-art models on three different datasets.
arXiv Detail & Related papers (2023-09-07T20:07:57Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - Multi-modal land cover mapping of remote sensing images using pyramid
attention and gated fusion networks [20.66034058363032]
We propose a new multi-modality network for land cover mapping of multi-modal remote sensing data based on a novel pyramid attention fusion (PAF) module and a gated fusion unit (GFU)
PAF module is designed to efficiently obtain rich fine-grained contextual representations from each modality with a built-in cross-level and cross-view attention fusion mechanism.
GFU module utilizes a novel gating mechanism for early merging of features, thereby diminishing hidden redundancies and noise.
arXiv Detail & Related papers (2021-11-06T10:01:01Z) - Learning Deep Multimodal Feature Representation with Asymmetric
Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network.
We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder.
Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.