Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal
and Multimodal Representations
- URL: http://arxiv.org/abs/2210.17444v1
- Date: Mon, 31 Oct 2022 16:14:18 GMT
- Title: Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal
and Multimodal Representations
- Authors: Sijie Mai, Ying Zeng, Haifeng Hu
- Abstract summary: We introduce the multimodal information bottleneck (MIB), aiming to learn a powerful and sufficient multimodal representation.
We develop three MIB variants, namely, early-fusion MIB, late-fusion MIB, and complete MIB, to focus on different perspectives of information constraints.
Experimental results suggest that the proposed method reaches state-of-the-art performance on the tasks of multimodal sentiment analysis and multimodal emotion recognition.
- Score: 27.855467591358018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning effective joint embedding for cross-modal data has always been a
focus in the field of multimodal machine learning. We argue that during
multimodal fusion, the generated multimodal embedding may be redundant, and the
discriminative unimodal information may be ignored, which often interferes with
accurate prediction and leads to a higher risk of overfitting. Moreover,
unimodal representations also contain noisy information that negatively
influences the learning of cross-modal dynamics. To this end, we introduce the
multimodal information bottleneck (MIB), aiming to learn a powerful and
sufficient multimodal representation that is free of redundancy and to filter
out noisy information in unimodal representations. Specifically, inheriting
from the general information bottleneck (IB), MIB aims to learn the minimal
sufficient representation for a given task by maximizing the mutual information
between the representation and the target and simultaneously constraining the
mutual information between the representation and the input data. Different
from general IB, our MIB regularizes both the multimodal and unimodal
representations, which is a comprehensive and flexible framework that is
compatible with any fusion methods. We develop three MIB variants, namely,
early-fusion MIB, late-fusion MIB, and complete MIB, to focus on different
perspectives of information constraints. Experimental results suggest that the
proposed method reaches state-of-the-art performance on the tasks of multimodal
sentiment analysis and multimodal emotion recognition across three widely used
datasets. The codes are available at
\url{https://github.com/TmacMai/Multimodal-Information-Bottleneck}.
Related papers
- Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance [10.580712937465032]
We identify the previously ignored gradient conflict between multimodal and unimodal learning objectives.
We propose MMPareto algorithm, which could ensure a final gradient with direction common to all learning objectives.
Our method is also expected to facilitate multi-task cases with a clear discrepancy in task difficulty.
arXiv Detail & Related papers (2024-05-28T01:19:13Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task
Learning with Auxiliary Mutual Information Maximization [2.4660652494309936]
Multimodal representation learning poses significant challenges.
Existing methods often struggle to exploit the unique characteristics of each modality.
In this study, we propose Self-MI in the self-supervised learning fashion.
arXiv Detail & Related papers (2023-11-07T08:10:36Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - Factorized Contrastive Learning: Going Beyond Multi-view Redundancy [116.25342513407173]
This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy.
On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-06-08T15:17:04Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Generalized Product-of-Experts for Learning Multimodal Representations
in Noisy Environments [18.14974353615421]
We propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique.
In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality.
We attain state-of-the-art performance on two challenging benchmarks: multimodal 3D hand-pose estimation and multimodal surgical video segmentation.
arXiv Detail & Related papers (2022-11-07T14:27:38Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.