Generative Modeling of Class Probability for Multi-Modal Representation Learning
- URL: http://arxiv.org/abs/2503.17417v2
- Date: Mon, 14 Apr 2025 06:45:58 GMT
- Title: Generative Modeling of Class Probability for Multi-Modal Representation Learning
- Authors: Jungkyoo Shin, Bumsoo Kim, Eunwoo Kim,
- Abstract summary: Multi-modal understanding plays a crucial role in artificial intelligence by enabling models to jointly interpret inputs from different modalities.<n>We propose a novel class anchor alignment approach that leverages class probability distributions for multi-modal representation learning.<n>Our method, Class-anchor-ALigned generative Modeling (CALM), encodes class anchors as prompts to generate and align class probability distributions for each modality.
- Score: 7.5696616045063845
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal understanding plays a crucial role in artificial intelligence by enabling models to jointly interpret inputs from different modalities. However, conventional approaches such as contrastive learning often struggle with modality discrepancies, leading to potential misalignments. In this paper, we propose a novel class anchor alignment approach that leverages class probability distributions for multi-modal representation learning. Our method, Class-anchor-ALigned generative Modeling (CALM), encodes class anchors as prompts to generate and align class probability distributions for each modality, enabling more effective alignment. Furthermore, we introduce a cross-modal probabilistic variational autoencoder to model uncertainty in the alignment, enhancing the ability to capture deeper relationships between modalities and data variations. Extensive experiments on four benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, especially in out-of-domain evaluations. This highlights its superior generalization capabilities in multi-modal representation learning.
Related papers
- Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion [6.621745547882088]
The existence of modality imbalance hinders multimodal learning from achieving its expected superiority over unimodal models in practice.<n>By designing a sustained boosting algorithm, we propose a novel multimodal learning approach to balance the classification ability of weak and strong modalities.
arXiv Detail & Related papers (2025-02-27T14:12:20Z) - Asymmetric Reinforcing against Multi-modal Representation Bias [59.685072206359855]
We propose an Asymmetric Reinforcing method against Multimodal representation bias (ARM)<n>Our ARM dynamically reinforces the weak modalities while maintaining the ability to represent dominant modalities through conditional mutual information.<n>We have significantly improved the performance of multimodal learning, making notable progress in mitigating imbalanced multimodal learning.
arXiv Detail & Related papers (2025-01-02T13:00:06Z) - Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multi-modal Latent Diffusion [8.316365279740188]
Multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities.
Existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities.
We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders.
arXiv Detail & Related papers (2023-06-07T14:16:44Z) - Kernel Density Matrices for Probabilistic Deep Learning [8.486487001779416]
In quantum mechanics, a density matrix is the most general way to describe the state of a quantum system.
This paper introduces a novel approach to probabilistic deep learning, kernel density matrices.
It provides a simpler yet effective mechanism for representing joint probability distributions of both continuous and discrete random variables.
arXiv Detail & Related papers (2023-05-26T12:59:58Z) - Multimodal Adversarially Learned Inference with Factorized
Discriminators [10.818838437018682]
We propose a novel approach to generative modeling of multimodal data based on generative adversarial networks.
To learn a coherent multimodal generative model, we show that it is necessary to align different encoder distributions with the joint decoder distribution simultaneously.
By taking advantage of contrastive learning through factorizing the discriminator, we train our model on unimodal data.
arXiv Detail & Related papers (2021-12-20T08:18:49Z) - Discriminative Multimodal Learning via Conditional Priors in Generative
Models [21.166519800652047]
This research studies the realistic scenario in which all modalities and class labels are available for model training.
We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities.
arXiv Detail & Related papers (2021-10-09T17:22:24Z) - Trusted Multi-View Classification [76.73585034192894]
We propose a novel multi-view classification method, termed trusted multi-view classification.
It provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level.
The proposed algorithm jointly utilizes multiple views to promote both classification reliability and robustness.
arXiv Detail & Related papers (2021-02-03T13:30:26Z) - Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors.
We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method.
Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z) - MHVAE: a Human-Inspired Deep Hierarchical Generative Model for
Multimodal Representation Learning [8.70928211339504]
We contribute the Multimodal Hierarchical Vari Auto-encoder (MHVAE), a hierarchical multimodal generative model for representation learning.
Inspired by human cognitive models, the MHVAE is able to learn modality-specific distributions and a joint-modality distribution, responsible for cross-modality inference.
Our model performs on par with other state-of-the-art generative models regarding joint-modality reconstruction from arbitrary input modalities and cross-modality inference.
arXiv Detail & Related papers (2020-06-04T16:24:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.