Score-Based Multimodal Autoencoders
- URL: http://arxiv.org/abs/2305.15708v1
- Date: Thu, 25 May 2023 04:43:47 GMT
- Title: Score-Based Multimodal Autoencoders
- Authors: Daniel Wesego and Amirmohammad Rooshenas
- Abstract summary: Multimodal Variational Autoencoders (VAEs) facilitate the construction of a tractable posterior within the latent space, given multiple modalities.
In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of unimodal VAEs.
Our model combines the superior generative quality of unimodal VAEs with coherent integration across different modalities.
- Score: 4.594159253008448
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Variational Autoencoders (VAEs) represent a promising group of
generative models that facilitate the construction of a tractable posterior
within the latent space, given multiple modalities. Daunhawer et al. (2022)
demonstrate that as the number of modalities increases, the generative quality
of each modality declines. In this study, we explore an alternative approach to
enhance the generative performance of multimodal VAEs by jointly modeling the
latent space of unimodal VAEs using score-based models (SBMs). The role of the
SBM is to enforce multimodal coherence by learning the correlation among the
latent variables. Consequently, our model combines the superior generative
quality of unimodal VAEs with coherent integration across different modalities.
Related papers
- Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains [114.76612918465948]
Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data.
We propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models.
arXiv Detail & Related papers (2025-01-10T04:35:46Z) - Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)
We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z) - Multimodal ELBO with Diffusion Decoders [0.9208007322096533]
We propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model.
The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs.
Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities.
arXiv Detail & Related papers (2024-08-29T20:12:01Z) - A Markov Random Field Multi-Modal Variational AutoEncoder [1.2233362977312945]
This work introduces a novel multimodal VAE that incorporates a Markov Random Field (MRF) into both the prior and posterior distributions.
Our approach is specifically designed to model and leverage the intricacies of these relationships, enabling a more faithful representation of multimodal data.
arXiv Detail & Related papers (2024-08-18T19:27:30Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - Learning multi-modal generative models with permutation-invariant encoders and tighter variational objectives [5.549794481031468]
Devising deep latent variable models for multi-modal data has been a long-standing theme in machine learning research.
In this work, we consider a variational objective that can tightly approximate the data log-likelihood.
We develop more flexible aggregation schemes that avoid the inductive biases in PoE or MoE approaches.
arXiv Detail & Related papers (2023-09-01T10:32:21Z) - Multi-modal Latent Diffusion [8.316365279740188]
Multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities.
Existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities.
We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders.
arXiv Detail & Related papers (2023-06-07T14:16:44Z) - On the Limitations of Multimodal VAEs [9.449650062296824]
Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for weakly-supervised data.
Despite their advantage of weak supervision, they exhibit a gap in generative quality compared to unimodal VAEs.
arXiv Detail & Related papers (2021-10-08T13:28:28Z) - Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors.
We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method.
Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.