Related papers: Deep Equilibrium Multimodal Fusion

Deep Equilibrium Multimodal Fusion

URL: http://arxiv.org/abs/2306.16645v1
Date: Thu, 29 Jun 2023 03:02:20 GMT
Title: Deep Equilibrium Multimodal Fusion
Authors: Jinhong Ni, Yalong Bai, Wei Zhang, Ting Yao, Tao Mei
Abstract summary: Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently. We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process. Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
Score: 88.04713412107947
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently. Most existing fusion approaches either learn a fixed fusion strategy during training and inference, or are only capable of fusing the information to a certain extent. Such solutions may fail to fully capture the dynamics of interactions across modalities especially when there are complex intra- and inter-modality correlations to be considered for informative multimodal fusion. In this paper, we propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process and modeling the feature correlations in an adaptive and recursive manner. This new way encodes the rich information within and across modalities thoroughly from low level to high level for efficacious downstream multimodal learning and is readily pluggable to various multimodal frameworks. Extensive experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion. More remarkably, DEQ fusion consistently achieves state-of-the-art performance on multiple multimodal benchmarks. The code will be released.

Related papers

DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis [62.31018417955254]
DeepMLF is a novel multimodal language model with learnable tokens tailored toward deep fusion. Our results confirm that deeper fusion leads to better performance, with optimal fusion depths (5-7) exceeding those of existing approaches.
arXiv Detail & Related papers (2025-04-15T11:28:02Z)
Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding [51.96911650437978]
Multi-modal fusion has played a vital role in multi-modal scene understanding. Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion. We propose a relational Part-Whole Fusion (PWRF) framework for multi-modal scene understanding.
arXiv Detail & Related papers (2024-10-19T02:27:30Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task Learning with Auxiliary Mutual Information Maximization [2.4660652494309936]
Multimodal representation learning poses significant challenges. Existing methods often struggle to exploit the unique characteristics of each modality. In this study, we propose Self-MI in the self-supervised learning fashion.
arXiv Detail & Related papers (2023-11-07T08:10:36Z)
Provable Dynamic Fusion for Low-Quality Multimodal Data [94.39538027450948]
Dynamic multimodal fusion emerges as a promising learning paradigm. Despite its widespread use, theoretical justifications in this field are still notably lacking. This paper provides theoretical understandings to answer this question under a most popular multimodal fusion framework from the generalization perspective. A novel multimodal fusion framework termed Quality-aware Multimodal Fusion (QMF) is proposed, which can improve the performance in terms of classification accuracy and model robustness.
arXiv Detail & Related papers (2023-06-03T08:32:35Z)
IMF: Interactive Multimodal Fusion Model for Link Prediction [13.766345726697404]
We introduce a novel Interactive Multimodal Fusion (IMF) model to integrate knowledge from different modalities. Our approach has been demonstrated to be effective through empirical evaluations on several real-world datasets.
arXiv Detail & Related papers (2023-03-20T01:20:02Z)
Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge. MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
A novel multimodal fusion network based on a joint coding model for lane line segmentation [22.89466867866239]
We introduce a novel multimodal fusion architecture from an information theory perspective. We demonstrate its practical utility using LiDAR camera fusion networks. Our optimal fusion network achieves 85%+ lane line accuracy and 98.7%+ overall.
arXiv Detail & Related papers (2021-03-20T06:47:58Z)
Investigating Vulnerability to Adversarial Examples on Multimodal Data Fusion in Deep Learning [32.125310341415755]
We investigated whether the current multimodal fusion model utilizes the complementary intelligence to defend against adversarial attacks. We verified that the multimodal fusion model optimized for better prediction is still vulnerable to adversarial attack, even if only one of the sensors is attacked.
arXiv Detail & Related papers (2020-05-22T03:45:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.