Multimodal ELBO with Diffusion Decoders
- URL: http://arxiv.org/abs/2408.16883v2
- Date: Mon, 03 Feb 2025 05:27:50 GMT
- Title: Multimodal ELBO with Diffusion Decoders
- Authors: Daniel Wesego, Pedram Rooshenas,
- Abstract summary: We propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model.
The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs.
Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities.
- Score: 0.9208007322096533
- License:
- Abstract: Multimodal variational autoencoders have demonstrated their ability to learn the relationships between different modalities by mapping them into a latent representation. Their design and capacity to perform any-to-any conditional and unconditional generation make them appealing. However, different variants of multimodal VAEs often suffer from generating low-quality output, particularly when complex modalities such as images are involved. In addition to that, they frequently exhibit low coherence among the generated modalities when sampling from the joint distribution. To address these limitations, we propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model. The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs. The multimodal model can also seamlessly integrate with a standard feed-forward decoder for different types of modality, facilitating end-to-end training and inference. Furthermore, we introduce an auxiliary score-based model to enhance the unconditional generation capabilities of our proposed approach. This approach addresses the limitations imposed by conventional multimodal VAEs and opens up new possibilities to improve multimodal generation tasks. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities.
Related papers
- MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks [50.98856172702256]
We propose the Modality-INformed knowledge Distillation (MIND) framework, a multimodal model compression approach.
MIND transfers knowledge from ensembles of pre-trained deep neural networks of varying sizes into a smaller multimodal student.
We evaluate MIND on binary and multilabel clinical prediction tasks using time series data and chest X-ray images.
arXiv Detail & Related papers (2025-02-03T08:50:00Z) - Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)
We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z) - Learning Multimodal Latent Generative Models with Energy-Based Prior [3.6648642834198797]
We propose a novel framework that integrates the latent generative model with the EBM.
This approach results in a more expressive and informative prior, better-capturing of information across multiple modalities.
arXiv Detail & Related papers (2024-09-30T01:38:26Z) - A Markov Random Field Multi-Modal Variational AutoEncoder [1.2233362977312945]
This work introduces a novel multimodal VAE that incorporates a Markov Random Field (MRF) into both the prior and posterior distributions.
Our approach is specifically designed to model and leverage the intricacies of these relationships, enabling a more faithful representation of multimodal data.
arXiv Detail & Related papers (2024-08-18T19:27:30Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Multi-modal Latent Diffusion [8.316365279740188]
Multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities.
Existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities.
We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders.
arXiv Detail & Related papers (2023-06-07T14:16:44Z) - Provable Dynamic Fusion for Low-Quality Multimodal Data [94.39538027450948]
Dynamic multimodal fusion emerges as a promising learning paradigm.
Despite its widespread use, theoretical justifications in this field are still notably lacking.
This paper provides theoretical understandings to answer this question under a most popular multimodal fusion framework from the generalization perspective.
A novel multimodal fusion framework termed Quality-aware Multimodal Fusion (QMF) is proposed, which can improve the performance in terms of classification accuracy and model robustness.
arXiv Detail & Related papers (2023-06-03T08:32:35Z) - Score-Based Multimodal Autoencoder [0.9208007322096533]
Multimodal Variational Autoencoders (VAEs) facilitate the construction of a tractable posterior within the latent space given multiple modalities.
Previous studies have shown that as the number of modalities increases, the generative quality of each modality declines.
In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of independently trained unimodal VAEs.
arXiv Detail & Related papers (2023-05-25T04:43:47Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - On the Limitations of Multimodal VAEs [9.449650062296824]
Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for weakly-supervised data.
Despite their advantage of weak supervision, they exhibit a gap in generative quality compared to unimodal VAEs.
arXiv Detail & Related papers (2021-10-08T13:28:28Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.