Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task
Learning with Auxiliary Mutual Information Maximization
- URL: http://arxiv.org/abs/2311.03785v1
- Date: Tue, 7 Nov 2023 08:10:36 GMT
- Title: Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task
Learning with Auxiliary Mutual Information Maximization
- Authors: Cam-Van Thi Nguyen, Ngoc-Hoa Thi Nguyen, Duc-Trong Le, Quang-Thuy Ha
- Abstract summary: Multimodal representation learning poses significant challenges.
Existing methods often struggle to exploit the unique characteristics of each modality.
In this study, we propose Self-MI in the self-supervised learning fashion.
- Score: 2.4660652494309936
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Multimodal representation learning poses significant challenges in capturing
informative and distinct features from multiple modalities. Existing methods
often struggle to exploit the unique characteristics of each modality due to
unified multimodal annotations. In this study, we propose Self-MI in the
self-supervised learning fashion, which also leverage Contrastive Predictive
Coding (CPC) as an auxiliary technique to maximize the Mutual Information (MI)
between unimodal input pairs and the multimodal fusion result with unimodal
inputs. Moreover, we design a label generation module, $ULG_{MI}$ for short,
that enables us to create meaningful and informative labels for each modality
in a self-supervised manner. By maximizing the Mutual Information, we encourage
better alignment between the multimodal fusion and the individual modalities,
facilitating improved multimodal fusion. Extensive experiments on three
benchmark datasets including CMU-MOSI, CMU-MOSEI, and SIMS, demonstrate the
effectiveness of Self-MI in enhancing the multimodal fusion task.
Related papers
- What to align in multimodal contrastive learning? [7.7439394183358745]
We introduce Contrastive MultiModal learning strategy that enables the communication between modalities in a single multimodal space.
Our theoretical analysis shows that shared, synergistic and unique terms of information naturally emerge from this formulation, allowing us to estimate multimodal interactions beyond redundancy.
In the latter, CoMM learns complex multimodal interactions and achieves state-of-the-art results on the six multimodal benchmarks.
arXiv Detail & Related papers (2024-09-11T16:42:22Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - MESEN: Exploit Multimodal Data to Design Unimodal Human Activity Recognition with Few Labels [11.853566358505434]
MESEN is a multimodal-empowered unimodal sensing framework.
Mesen exploits unlabeled multimodal data to extract effective unimodal features for each modality.
Mesen achieves significant performance improvements over state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-02T13:54:05Z) - MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [63.78384552789171]
This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm.
We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives.
Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.
We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z) - IMF: Interactive Multimodal Fusion Model for Link Prediction [13.766345726697404]
We introduce a novel Interactive Multimodal Fusion (IMF) model to integrate knowledge from different modalities.
Our approach has been demonstrated to be effective through empirical evaluations on several real-world datasets.
arXiv Detail & Related papers (2023-03-20T01:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.