SFusion: Self-attention based N-to-One Multimodal Fusion Block
- URL: http://arxiv.org/abs/2208.12776v2
- Date: Tue, 4 Jul 2023 14:50:31 GMT
- Title: SFusion: Self-attention based N-to-One Multimodal Fusion Block
- Authors: Zecheng Liu and Jia Wei and Rui Li and Jianlong Zhou
- Abstract summary: We propose a self-attention based fusion block called SFusion.
It learns to fuse available modalities without synthesizing or zero-padding missing ones.
In this work, we apply SFusion to different backbone networks for human activity recognition and brain tumor segmentation tasks.
- Score: 6.059397373352718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: People perceive the world with different senses, such as sight, hearing,
smell, and touch. Processing and fusing information from multiple modalities
enables Artificial Intelligence to understand the world around us more easily.
However, when there are missing modalities, the number of available modalities
is different in diverse situations, which leads to an N-to-One fusion problem.
To solve this problem, we propose a self-attention based fusion block called
SFusion. Different from preset formulations or convolution based methods, the
proposed block automatically learns to fuse available modalities without
synthesizing or zero-padding missing ones. Specifically, the feature
representations extracted from upstream processing model are projected as
tokens and fed into self-attention module to generate latent multimodal
correlations. Then, a modal attention mechanism is introduced to build a shared
representation, which can be applied by the downstream decision model. The
proposed SFusion can be easily integrated into existing multimodal analysis
networks. In this work, we apply SFusion to different backbone networks for
human activity recognition and brain tumor segmentation tasks. Extensive
experimental results show that the SFusion block achieves better performance
than the competing fusion strategies. Our code is available at
https://github.com/scut-cszcl/SFusion.
Related papers
- DRKF: Decoupled Representations with Knowledge Fusion for Multimodal Emotion Recognition [5.765485747592163]
We propose a Decoupled Representations with Knowledge Fusion (DRKF) method for multimodal emotion recognition.<n>DRKF consists of two main modules: an Optimized Representation Learning (ORL) Module and a Knowledge Fusion (KF) Module.<n>Experiments show that DRKF achieves state-of-the-art (SOTA) performance on IEMOCAP, MELD, and M3ED.
arXiv Detail & Related papers (2025-08-03T08:05:57Z) - Generalized Interpolating Discrete Diffusion [65.74168524007484]
Masked diffusion is a popular choice due to its simplicity and effectiveness.
We derive the theoretical backbone of a family of general interpolating discrete diffusion processes.
Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise.
arXiv Detail & Related papers (2025-03-06T14:30:55Z) - Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal Recommendation [7.627299398469962]
We propose a new Spectrum-based Modality Representation graph recommender.
It aims to capture both uni-modal and fusion preferences while simultaneously suppressing modality noise.
Experiments on three real-world datasets show the efficacy of our proposed model.
arXiv Detail & Related papers (2024-12-19T15:53:21Z) - Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID.
Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z) - A Multi-Stage Adaptive Feature Fusion Neural Network for Multimodal Gait
Recognition [15.080096318551346]
Most existing gait recognition algorithms are unimodal, and a few multimodal gait recognition algorithms perform multimodal fusion only once.
We propose a multi-stage feature fusion strategy (MSFFS), which performs multimodal fusions at different stages in the feature extraction process.
Also, we propose an adaptive feature fusion module (AFFM) that considers the semantic association between silhouettes and skeletons.
arXiv Detail & Related papers (2023-12-22T03:25:15Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion
Models [54.1843419649895]
We propose a solution based on denoising diffusion probabilistic models (DDPMs)
Our motivation for choosing diffusion models over other generative models comes from the flexible internal structure of diffusion models.
Our method can unite multiple diffusion models trained on multiple sub-tasks and conquer the combined task.
arXiv Detail & Related papers (2022-12-01T18:59:55Z) - Is Cross-Attention Preferable to Self-Attention for Multi-Modal Emotion
Recognition? [36.67937514793215]
Cross-modal attention is seen as an effective mechanism for multi-modal fusion.
We implement and compare a cross-attention and a self-attention model.
We compare the models using different modality combinations for a 7-class emotion classification task.
arXiv Detail & Related papers (2022-02-18T15:44:14Z) - LMR-CBT: Learning Modality-fused Representations with CB-Transformer for
Multimodal Emotion Recognition from Unaligned Multimodal Sequences [5.570499497432848]
We propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition.
We conduct word-aligned and unaligned experiments on three challenging datasets.
arXiv Detail & Related papers (2021-12-03T03:43:18Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.