Improving Multimodal Joint Variational Autoencoders through Normalizing
Flows and Correlation Analysis
- URL: http://arxiv.org/abs/2305.11832v1
- Date: Fri, 19 May 2023 17:15:34 GMT
- Title: Improving Multimodal Joint Variational Autoencoders through Normalizing
Flows and Correlation Analysis
- Authors: Agathe Senellart, Cl\'ement Chadebec, St\'ephanie Allassonni\`ere
- Abstract summary: The unimodal posteriors are conditioned on the Deep Canonical Correlation Analysis embeddings.
We also use Normalizing Flows to enrich the unimodal posteriors and achieve more diverse data generation.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new multimodal variational autoencoder that enables to generate
from the joint distribution and conditionally to any number of complex
modalities. The unimodal posteriors are conditioned on the Deep Canonical
Correlation Analysis embeddings which preserve the shared information across
modalities leading to more coherent cross-modal generations. Furthermore, we
use Normalizing Flows to enrich the unimodal posteriors and achieve more
diverse data generation. Finally, we propose to use a Product of Experts for
inferring one modality from several others which makes the model scalable to
any number of modalities. We demonstrate that our method improves likelihood
estimates, diversity of the generations and in particular coherence metrics in
the conditional generations on several datasets.
Related papers
- Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts
in Underspecified Visual Tasks [92.32670915472099]
We propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs)
We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection.
arXiv Detail & Related papers (2023-10-03T17:37:52Z) - Learning multi-modal generative models with permutation-invariant encoders and tighter variational bounds [5.549794481031468]
Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research.
In this work, we consider a variational bound that can tightly approximate the data log-likelihood.
We develop more flexible aggregation schemes that generalize PoE or MoE approaches by combining encoded features from different modalities based on permutation-invariant neural networks.
arXiv Detail & Related papers (2023-09-01T10:32:21Z) - Multi-modal Latent Diffusion [8.316365279740188]
Multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities.
Existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities.
We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders.
arXiv Detail & Related papers (2023-06-07T14:16:44Z) - Score-Based Multimodal Autoencoders [4.594159253008448]
Multimodal Variational Autoencoders (VAEs) facilitate the construction of a tractable posterior within the latent space, given multiple modalities.
In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of unimodal VAEs.
Our model combines the superior generative quality of unimodal VAEs with coherent integration across different modalities.
arXiv Detail & Related papers (2023-05-25T04:43:47Z) - Generalizing Multimodal Variational Methods to Sets [35.69942798534849]
This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space.
By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization.
arXiv Detail & Related papers (2022-12-19T23:50:19Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Learning Multimodal VAEs through Mutual Supervision [72.77685889312889]
MEME combines information between modalities implicitly through mutual supervision.
We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes.
arXiv Detail & Related papers (2021-06-23T17:54:35Z) - Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [55.28436972267793]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors.
We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method.
Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z) - Learning more expressive joint distributions in multimodal variational
methods [0.17188280334580194]
We introduce a method that improves the representational capacity of multimodal variational methods using normalizing flows.
We demonstrate that the model improves on state-of-the-art multimodal methods based on variational inference on various computer vision tasks.
We also show that learning more powerful approximate joint distributions improves the quality of the generated samples.
arXiv Detail & Related papers (2020-09-08T11:45:27Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z) - Towards Multimodal Response Generation with Exemplar Augmentation and
Curriculum Optimization [73.45742420178196]
We propose a novel multimodal response generation framework with exemplar augmentation and curriculum optimization.
Our model achieves significant improvements compared to strong baselines in terms of diversity and relevance.
arXiv Detail & Related papers (2020-04-26T16:29:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.