Related papers: On the Limitations of Multimodal VAEs

On the Limitations of Multimodal VAEs

URL: http://arxiv.org/abs/2110.04121v1
Date: Fri, 8 Oct 2021 13:28:28 GMT
Title: On the Limitations of Multimodal VAEs
Authors: Imant Daunhawer, Thomas M. Sutter, Kieran Chin-Cheong, Emanuele Palumbo and Julia E. Vogt
Abstract summary: Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for weakly-supervised data. Despite their advantage of weak supervision, they exhibit a gap in generative quality compared to unimodal VAEs.
Score: 9.449650062296824
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for weakly-supervised data. Yet, despite their advantage of weak supervision, they exhibit a gap in generative quality compared to unimodal VAEs, which are completely unsupervised. In an attempt to explain this gap, we uncover a fundamental limitation that applies to a large family of mixture-based multimodal VAEs. We prove that the sub-sampling of modalities enforces an undesirable upper bound on the multimodal ELBO and thereby limits the generative quality of the respective models. Empirically, we showcase the generative quality gap on both synthetic and real data and present the tradeoffs between different variants of multimodal VAEs. We find that none of the existing approaches fulfills all desired criteria of an effective multimodal generative model when applied on more complex datasets than those used in previous benchmarks. In summary, we identify, formalize, and validate fundamental limitations of VAE-based approaches for modeling weakly-supervised data and discuss implications for real-world applications.

Related papers

Principled Multimodal Representation Learning [70.60542106731813]
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities.<n>Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain.<n>We propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities.
arXiv Detail & Related papers (2025-07-23T09:12:25Z)
Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models [24.73190742678142]
We study the risk of generative model collapse in multi-modal vision-language generative systems.<n>We find that model collapse exhibits distinct characteristics in the multi-modal context, such as improved vision-language alignment and increased variance in image-captioning task.<n>Our findings provide initial insights and practical guidelines for reducing the risk of model collapse in self-improving multi-agent AI systems.
arXiv Detail & Related papers (2025-05-10T22:42:29Z)
Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders [32.87811217394167]
Multimodal learning with variational autoencoders (VAEs) requires estimating joint distributions to evaluate the evidence lower bound (ELBO)<n>This research introduces a novel methodology for aggregating single-modality distributions by exploiting the principle of consensus of dependent experts (CoDE)<n>The resulting CoDE-VAE model demonstrates better performance in terms of balancing the trade-off between generative coherence and generative quality, as well as generating more precise log-likelihood estimations.
arXiv Detail & Related papers (2025-05-02T09:24:10Z)
Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference [20.761803725098005]
Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities. A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number of inference networks for all possible modality combinations. We introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework.
arXiv Detail & Related papers (2024-10-15T08:49:38Z)
MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models. We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z)
Multimodal ELBO with Diffusion Decoders [0.9208007322096533]
We propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model. The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities.
arXiv Detail & Related papers (2024-08-29T20:12:01Z)
Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a new sandbox suite tailored for integrated data-model co-development. This sandbox provides a feedback-driven experimental platform, enabling cost-effective and guided refinement of both data and models.
arXiv Detail & Related papers (2024-07-16T14:40:07Z)
Multi-modal Latent Diffusion [8.316365279740188]
Multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities. Existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities. We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders.
arXiv Detail & Related papers (2023-06-07T14:16:44Z)
Score-Based Multimodal Autoencoders [4.594159253008448]
Multimodal Variational Autoencoders (VAEs) facilitate the construction of a tractable posterior within the latent space, given multiple modalities. In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of unimodal VAEs. Our model combines the superior generative quality of unimodal VAEs with coherent integration across different modalities.
arXiv Detail & Related papers (2023-05-25T04:43:47Z)
Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework [89.8609061423685]
We propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task. To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks. We demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies.
arXiv Detail & Related papers (2023-02-23T18:59:05Z)
Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN) We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z)
Discriminative Multimodal Learning via Conditional Priors in Generative Models [21.166519800652047]
This research studies the realistic scenario in which all modalities and class labels are available for model training. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities.
arXiv Detail & Related papers (2021-10-09T17:22:24Z)
Generalized Multimodal ELBO [11.602089225841631]
Multiple data types naturally co-occur when describing real-world phenomena and learning from them is a long-standing goal in machine learning research. Existing self-supervised generative models approximating an ELBO are not able to fulfill all desired requirements of multimodal models. We propose a new, generalized ELBO formulation for multimodal data that overcomes these limitations.
arXiv Detail & Related papers (2021-05-06T07:05:00Z)
Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors. We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method. Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z)
Accounting for Unobserved Confounding in Domain Generalization [107.0464488046289]
This paper investigates the problem of learning robust, generalizable prediction models from a combination of datasets. Part of the challenge of learning robust models lies in the influence of unobserved confounders. We demonstrate the empirical performance of our approach on healthcare data from different modalities.
arXiv Detail & Related papers (2020-07-21T08:18:06Z)
Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.