Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning
- URL: http://arxiv.org/abs/2303.05952v1
- Date: Fri, 10 Mar 2023 14:38:49 GMT
- Title: Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning
- Authors: Qian Jiang, Changyou Chen, Han Zhao, Liqun Chen, Qing Ping, Son Dinh
Tran, Yi Xu, Belinda Zeng, Trishul Chilimbi
- Abstract summary: We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
- Score: 53.68371566336254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive loss has been increasingly used in learning representations from
multiple modalities. In the limit, the nature of the contrastive loss
encourages modalities to exactly match each other in the latent space. Yet it
remains an open question how the modality alignment affects the downstream task
performance. In this paper, based on an information-theoretic argument, we
first prove that exact modality alignment is sub-optimal in general for
downstream prediction tasks. Hence we advocate that the key of better
performance lies in meaningful latent modality structures instead of perfect
modality alignment. To this end, we propose three general approaches to
construct latent modality structures. Specifically, we design 1) a deep feature
separation loss for intra-modality regularization; 2) a Brownian-bridge loss
for inter-modality regularization; and 3) a geometric consistency loss for both
intra- and inter-modality regularization. Extensive experiments are conducted
on two popular multi-modal representation learning frameworks: the CLIP-based
two-tower model and the ALBEF-based fusion model. We test our model on a
variety of tasks including zero/few-shot image classification, image-text
retrieval, visual question answering, visual reasoning, and visual entailment.
Our method achieves consistent improvements over existing methods,
demonstrating the effectiveness and generalizability of our proposed approach
on latent modality structure regularization.
Related papers
- Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Efficient Alternating Minimization Solvers for Wyner Multi-View
Unsupervised Learning [0.0]
We propose two novel formulations that enable the development of computational efficient solvers based the alternating principle.
The proposed solvers offer computational efficiency, theoretical convergence guarantees, local minima complexity with the number of views, and exceptional accuracy as compared with the state-of-the-art techniques.
arXiv Detail & Related papers (2023-03-28T10:17:51Z) - Synergies between Disentanglement and Sparsity: Generalization and
Identifiability in Multi-Task Learning [79.83792914684985]
We prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations.
Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem.
arXiv Detail & Related papers (2022-11-26T21:02:09Z) - Efficient Multimodal Transformer with Dual-Level Feature Restoration for
Robust Multimodal Sentiment Analysis [47.29528724322795]
Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently.
Despite significant progress, there are still two major challenges on the way towards robust MSA.
We propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR)
arXiv Detail & Related papers (2022-08-16T08:02:30Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Towards Robust and Adaptive Motion Forecasting: A Causal Representation
Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables.
We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph.
Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z) - Efficient Iterative Amortized Inference for Learning Symmetric and
Disentangled Multi-Object Representations [8.163697683448811]
We introduce EfficientMORL, an efficient framework for the unsupervised learning of object-centric representations.
We show that optimization challenges caused by requiring both symmetry and disentanglement can be addressed by high-cost iterative amortized inference.
We demonstrate strong object decomposition and disentanglement on the standard multi-object benchmark while achieving nearly an order of magnitude faster training and test time inference.
arXiv Detail & Related papers (2021-06-07T14:02:49Z) - COBRA: Contrastive Bi-Modal Representation Algorithm [43.33840912256077]
We present a novel framework that aims to train two modalities in a joint fashion inspired by Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE) paradigms.
We empirically show that this framework reduces the modality gap significantly and generates a robust and task agnostic joint-embedding space.
We outperform existing work on four diverse downstream tasks spanning across seven benchmark cross-modal datasets.
arXiv Detail & Related papers (2020-05-07T18:20:12Z) - Unpaired Multi-modal Segmentation via Knowledge Distillation [77.39798870702174]
We propose a novel learning scheme for unpaired cross-modality image segmentation.
In our method, we heavily reuse network parameters, by sharing all convolutional kernels across CT and MRI.
We have extensively validated our approach on two multi-class segmentation problems.
arXiv Detail & Related papers (2020-01-06T20:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.