Deep Equilibrium Multimodal Fusion
- URL: http://arxiv.org/abs/2306.16645v1
- Date: Thu, 29 Jun 2023 03:02:20 GMT
- Title: Deep Equilibrium Multimodal Fusion
- Authors: Jinhong Ni, Yalong Bai, Wei Zhang, Ting Yao, Tao Mei
- Abstract summary: Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
- Score: 88.04713412107947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal fusion integrates the complementary information present in
multiple modalities and has gained much attention recently. Most existing
fusion approaches either learn a fixed fusion strategy during training and
inference, or are only capable of fusing the information to a certain extent.
Such solutions may fail to fully capture the dynamics of interactions across
modalities especially when there are complex intra- and inter-modality
correlations to be considered for informative multimodal fusion. In this paper,
we propose a novel deep equilibrium (DEQ) method towards multimodal fusion via
seeking a fixed point of the dynamic multimodal fusion process and modeling the
feature correlations in an adaptive and recursive manner. This new way encodes
the rich information within and across modalities thoroughly from low level to
high level for efficacious downstream multimodal learning and is readily
pluggable to various multimodal frameworks. Extensive experiments on BRCA,
MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ
fusion. More remarkably, DEQ fusion consistently achieves state-of-the-art
performance on multiple multimodal benchmarks. The code will be released.
Related papers
- Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding [51.96911650437978]
Multi-modal fusion has played a vital role in multi-modal scene understanding.
Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion.
We propose a relational Part-Whole Fusion (PWRF) framework for multi-modal scene understanding.
arXiv Detail & Related papers (2024-10-19T02:27:30Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task
Learning with Auxiliary Mutual Information Maximization [2.4660652494309936]
Multimodal representation learning poses significant challenges.
Existing methods often struggle to exploit the unique characteristics of each modality.
In this study, we propose Self-MI in the self-supervised learning fashion.
arXiv Detail & Related papers (2023-11-07T08:10:36Z) - Provable Dynamic Fusion for Low-Quality Multimodal Data [94.39538027450948]
Dynamic multimodal fusion emerges as a promising learning paradigm.
Despite its widespread use, theoretical justifications in this field are still notably lacking.
This paper provides theoretical understandings to answer this question under a most popular multimodal fusion framework from the generalization perspective.
A novel multimodal fusion framework termed Quality-aware Multimodal Fusion (QMF) is proposed, which can improve the performance in terms of classification accuracy and model robustness.
arXiv Detail & Related papers (2023-06-03T08:32:35Z) - IMF: Interactive Multimodal Fusion Model for Link Prediction [13.766345726697404]
We introduce a novel Interactive Multimodal Fusion (IMF) model to integrate knowledge from different modalities.
Our approach has been demonstrated to be effective through empirical evaluations on several real-world datasets.
arXiv Detail & Related papers (2023-03-20T01:20:02Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - A novel multimodal fusion network based on a joint coding model for lane
line segmentation [22.89466867866239]
We introduce a novel multimodal fusion architecture from an information theory perspective.
We demonstrate its practical utility using LiDAR camera fusion networks.
Our optimal fusion network achieves 85%+ lane line accuracy and 98.7%+ overall.
arXiv Detail & Related papers (2021-03-20T06:47:58Z) - Investigating Vulnerability to Adversarial Examples on Multimodal Data
Fusion in Deep Learning [32.125310341415755]
We investigated whether the current multimodal fusion model utilizes the complementary intelligence to defend against adversarial attacks.
We verified that the multimodal fusion model optimized for better prediction is still vulnerable to adversarial attack, even if only one of the sensors is attacked.
arXiv Detail & Related papers (2020-05-22T03:45:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.