Efficient Multimodal Diffusion Models Using Joint Data Infilling with
Partially Shared U-Net
- URL: http://arxiv.org/abs/2311.16488v1
- Date: Tue, 28 Nov 2023 04:34:44 GMT
- Title: Efficient Multimodal Diffusion Models Using Joint Data Infilling with
Partially Shared U-Net
- Authors: Zizhao Hu, Shaochong Jia, Mohammad Rostami
- Abstract summary: Partially Shared U-Net (PS-U-Net) is an efficient multimodal diffusion model that allows text and image inputs to pass through dedicated layers and skip-connections for preserving modality-specific fine-grained details.
Inspired by image inpainting, we also propose a new efficient multimodal sampling method that introduces new scenarios for conditional generation while only requiring a simple joint distribution to be learned.
Our empirical exploration of the MS-COCO dataset demonstrates that our method generates multimodal text and image data with higher quality compared to existing multimodal diffusion models.
- Score: 20.437172251393257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, diffusion models have been used successfully to fit distributions
for cross-modal data translation and multimodal data generation. However, these
methods rely on extensive scaling, overlooking the inefficiency and
interference between modalities. We develop Partially Shared U-Net (PS-U-Net)
architecture which is an efficient multimodal diffusion model that allows text
and image inputs to pass through dedicated layers and skip-connections for
preserving modality-specific fine-grained details. Inspired by image
inpainting, we also propose a new efficient multimodal sampling method that
introduces new scenarios for conditional generation while only requiring a
simple joint distribution to be learned. Our empirical exploration of the
MS-COCO dataset demonstrates that our method generates multimodal text and
image data with higher quality compared to existing multimodal diffusion models
while having a comparable size, faster training, faster multimodal sampling,
and more flexible generation.
Related papers
- Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - HMDN: Hierarchical Multi-Distribution Network for Click-Through Rate Prediction [26.32695178700689]
We propose a flexible modeling paradigm, named Hierarchical Multi-Distribution Network (HMDN)
HMDN efficiently models mixed multi-distributions and can seamlessly integrate with existing multi-distribution methods.
Experimental results on both public and industrial datasets validate the effectiveness and flexibility of HMDN.
arXiv Detail & Related papers (2024-08-02T15:29:59Z) - Diffusion Models For Multi-Modal Generative Modeling [32.61765315067488]
We propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space.
We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling.
arXiv Detail & Related papers (2024-07-24T18:04:17Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Multi-modal Latent Diffusion [8.316365279740188]
Multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities.
Existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities.
We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders.
arXiv Detail & Related papers (2023-06-07T14:16:44Z) - Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion
Models [54.1843419649895]
We propose a solution based on denoising diffusion probabilistic models (DDPMs)
Our motivation for choosing diffusion models over other generative models comes from the flexible internal structure of diffusion models.
Our method can unite multiple diffusion models trained on multiple sub-tasks and conquer the combined task.
arXiv Detail & Related papers (2022-12-01T18:59:55Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Versatile Diffusion: Text, Images and Variations All in One Diffusion
Model [76.89932822375208]
Versatile Diffusion handles multiple flows of text-to-image, image-to-text, and variations in one unified model.
Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion.
arXiv Detail & Related papers (2022-11-15T17:44:05Z) - Learning more expressive joint distributions in multimodal variational
methods [0.17188280334580194]
We introduce a method that improves the representational capacity of multimodal variational methods using normalizing flows.
We demonstrate that the model improves on state-of-the-art multimodal methods based on variational inference on various computer vision tasks.
We also show that learning more powerful approximate joint distributions improves the quality of the generated samples.
arXiv Detail & Related papers (2020-09-08T11:45:27Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.