Multi-modal Latent Diffusion
- URL: http://arxiv.org/abs/2306.04445v2
- Date: Mon, 18 Dec 2023 10:43:55 GMT
- Title: Multi-modal Latent Diffusion
- Authors: Mustapha Bounoua, Giulio Franzese, Pietro Michiardi
- Abstract summary: Multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities.
Existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities.
We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders.
- Score: 8.316365279740188
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal data-sets are ubiquitous in modern applications, and multi-modal
Variational Autoencoders are a popular family of models that aim to learn a
joint representation of the different modalities. However, existing approaches
suffer from a coherence-quality tradeoff, where models with good generation
quality lack generative coherence across modalities, and vice versa. We discuss
the limitations underlying the unsatisfactory performance of existing methods,
to motivate the need for a different approach. We propose a novel method that
uses a set of independently trained, uni-modal, deterministic autoencoders.
Individual latent variables are concatenated into a common latent space, which
is fed to a masked diffusion model to enable generative modeling. We also
introduce a new multi-time training method to learn the conditional score
network for multi-modal diffusion. Our methodology substantially outperforms
competitors in both generation quality and coherence, as shown through an
extensive experimental campaign.
Related papers
- Learning Multimodal Latent Generative Models with Energy-Based Prior [3.6648642834198797]
We propose a novel framework that integrates the latent generative model with the EBM.
This approach results in a more expressive and informative prior, better-capturing of information across multiple modalities.
arXiv Detail & Related papers (2024-09-30T01:38:26Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Provable Dynamic Fusion for Low-Quality Multimodal Data [94.39538027450948]
Dynamic multimodal fusion emerges as a promising learning paradigm.
Despite its widespread use, theoretical justifications in this field are still notably lacking.
This paper provides theoretical understandings to answer this question under a most popular multimodal fusion framework from the generalization perspective.
A novel multimodal fusion framework termed Quality-aware Multimodal Fusion (QMF) is proposed, which can improve the performance in terms of classification accuracy and model robustness.
arXiv Detail & Related papers (2023-06-03T08:32:35Z) - Score-Based Multimodal Autoencoders [4.594159253008448]
Multimodal Variational Autoencoders (VAEs) facilitate the construction of a tractable posterior within the latent space, given multiple modalities.
In this study, we explore an alternative approach to enhance the generative performance of multimodal VAEs by jointly modeling the latent space of unimodal VAEs.
Our model combines the superior generative quality of unimodal VAEs with coherent integration across different modalities.
arXiv Detail & Related papers (2023-05-25T04:43:47Z) - Collaborative Diffusion for Multi-Modal Face Generation and Editing [34.16906110777047]
We present Collaborative Diffusion, where pre-trained uni-modal diffusion models collaborate to achieve multi-modal face generation and editing without re-training.
Specifically, we propose dynamic diffuser, a meta-network that adaptively hallucinates multi-modal denoising steps by predicting the spatial-temporal influence functions for each pre-trained uni-modal model.
arXiv Detail & Related papers (2023-04-20T17:59:02Z) - Generalizing Multimodal Variational Methods to Sets [35.69942798534849]
This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space.
By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization.
arXiv Detail & Related papers (2022-12-19T23:50:19Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - A survey of multimodal deep generative models [20.717591403306287]
Multimodal learning is a framework for building models that make predictions based on different types of modalities.
Deep generative models in which distributions are parameterized by deep neural networks have attracted much attention.
arXiv Detail & Related papers (2022-07-05T15:48:51Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.