Multimodal Understanding Through Correlation Maximization and
Minimization
- URL: http://arxiv.org/abs/2305.03125v1
- Date: Thu, 4 May 2023 19:53:05 GMT
- Title: Multimodal Understanding Through Correlation Maximization and
Minimization
- Authors: Yifeng Shi, Marc Niethammer
- Abstract summary: We study the intrinsic nature of multimodal data by asking the following questions.
Can we learn more structured latent representations of general multimodal data?
Can we intuitively understand, both mathematically and visually, what the latent representations capture?
- Score: 23.8764755753415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal learning has mainly focused on learning large models on, and
fusing feature representations from, different modalities for better
performances on downstream tasks. In this work, we take a detour from this
trend and study the intrinsic nature of multimodal data by asking the following
questions: 1) Can we learn more structured latent representations of general
multimodal data?; and 2) can we intuitively understand, both mathematically and
visually, what the latent representations capture? To answer 1), we propose a
general and lightweight framework, Multimodal Understanding Through Correlation
Maximization and Minimization (MUCMM), that can be incorporated into any large
pre-trained network. MUCMM learns both the common and individual
representations. The common representations capture what is common between the
modalities; the individual representations capture the unique aspect of the
modalities. To answer 2), we propose novel scores that summarize the learned
common and individual structures and visualize the score gradients with respect
to the input, visually discerning what the different representations capture.
We further provide mathematical intuitions of the computed gradients in a
linear setting, and demonstrate the effectiveness of our approach through a
variety of experiments.
Related papers
- On the Comparison between Multi-modal and Single-modal Contrastive Learning [50.74988548106031]
We introduce a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning.
We identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning.
Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning.
arXiv Detail & Related papers (2024-11-05T06:21:17Z) - Multiple Heads are Better than One: Mixture of Modality Knowledge Experts for Entity Representation Learning [51.80447197290866]
Learning high-quality multi-modal entity representations is an important goal of multi-modal knowledge graph (MMKG) representation learning.
Existing methods focus on crafting elegant entity-wise multi-modal fusion strategies.
We introduce a novel framework with Mixture of Modality Knowledge experts (MoMoK) to learn adaptive multi-modal entity representations.
arXiv Detail & Related papers (2024-05-27T06:36:17Z) - Constrained Multiview Representation for Self-supervised Contrastive
Learning [4.817827522417457]
We introduce a novel approach predicated on representation distance-based mutual information (MI) for measuring the significance of different views.
We harness multi-view representations extracted from the frequency domain, re-evaluating their significance based on mutual information.
arXiv Detail & Related papers (2024-02-05T19:09:33Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Multimodal Graph Learning for Generative Tasks [89.44810441463652]
Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize.
We propose Multimodal Graph Learning (MMGL), a framework for capturing information from multiple multimodal neighbors with relational structures among them.
arXiv Detail & Related papers (2023-10-11T13:25:03Z) - Decoupling Common and Unique Representations for Multimodal Self-supervised Learning [22.12729786091061]
We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning.
By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities.
arXiv Detail & Related papers (2023-09-11T08:35:23Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Identifiability Results for Multimodal Contrastive Learning [72.15237484019174]
We show that it is possible to recover shared factors in a more general setup than the multi-view setting studied previously.
Our work provides a theoretical basis for multimodal representation learning and explains in which settings multimodal contrastive learning can be effective in practice.
arXiv Detail & Related papers (2023-03-16T09:14:26Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z) - How to Sense the World: Leveraging Hierarchy in Multimodal Perception
for Robust Reinforcement Learning Agents [9.840104333194663]
We argue for hierarchy in the design of representation models and contribute with a novel multimodal representation model, MUSE.
MUSE is the sensory representation model of deep reinforcement learning agents provided with multimodal observations in Atari games.
We perform a comparative study over different designs of reinforcement learning agents, showing that MUSE allows agents to perform tasks under incomplete perceptual experience with minimal performance loss.
arXiv Detail & Related papers (2021-10-07T16:35:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.