Learning multi-modal generative models with permutation-invariant encoders and tighter variational bounds
- URL: http://arxiv.org/abs/2309.00380v2
- Date: Fri, 19 Apr 2024 03:24:07 GMT
- Title: Learning multi-modal generative models with permutation-invariant encoders and tighter variational bounds
- Authors: Marcel Hirt, Domenico Campolo, Victoria Leong, Juan-Pablo Ortega,
- Abstract summary: Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research.
In this work, we consider a variational bound that can tightly approximate the data log-likelihood.
We develop more flexible aggregation schemes that generalize PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks.
- Score: 5.549794481031468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research. Multi-modal Variational Autoencoders (VAEs) have been a popular generative model class that learns latent representations that jointly explain multiple modalities. Various objective functions for such models have been suggested, often motivated as lower bounds on the multi-modal data log-likelihood or from information-theoretic considerations. To encode latent variables from different modality subsets, Product-of-Experts (PoE) or Mixture-of-Experts (MoE) aggregation schemes have been routinely used and shown to yield different trade-offs, for instance, regarding their generative quality or consistency across multiple modalities. In this work, we consider a variational bound that can tightly approximate the data log-likelihood. We develop more flexible aggregation schemes that generalize PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks. Our numerical experiments illustrate trade-offs for multi-modal variational bounds and various aggregation schemes. We show that tighter variational bounds and more flexible aggregation models can become beneficial when one wants to approximate the true joint distribution over observed modalities and latent variables in identifiable models.
Related papers
- U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Model Composition for Multimodal Large Language Models [73.70317850267149]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - Mitigating Biases with Diverse Ensembles and Diffusion Models [99.6100669122048]
We propose an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs)
We show that DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features.
We show that DPM-guided diversification is sufficient to remove dependence on primary shortcut cues, without a need for additional supervised signals.
arXiv Detail & Related papers (2023-11-23T15:47:33Z) - Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - Improving Multimodal Joint Variational Autoencoders through Normalizing
Flows and Correlation Analysis [0.0]
The unimodal posteriors are conditioned on the Deep Canonical Correlation Analysis embeddings.
We also use Normalizing Flows to enrich the unimodal posteriors and achieve more diverse data generation.
arXiv Detail & Related papers (2023-05-19T17:15:34Z) - Generalizing Multimodal Variational Methods to Sets [35.69942798534849]
This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space.
By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization.
arXiv Detail & Related papers (2022-12-19T23:50:19Z) - Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [55.28436972267793]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors.
We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method.
Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z) - Variational Dynamic Mixtures [18.730501689781214]
We develop variational dynamic mixtures (VDM) to infer sequential latent variables.
In an empirical study, we show that VDM outperforms competing approaches on highly multi-modal datasets.
arXiv Detail & Related papers (2020-10-20T16:10:07Z) - Learning more expressive joint distributions in multimodal variational
methods [0.17188280334580194]
We introduce a method that improves the representational capacity of multimodal variational methods using normalizing flows.
We demonstrate that the model improves on state-of-the-art multimodal methods based on variational inference on various computer vision tasks.
We also show that learning more powerful approximate joint distributions improves the quality of the generated samples.
arXiv Detail & Related papers (2020-09-08T11:45:27Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.