Collaborative Diffusion for Multi-Modal Face Generation and Editing
- URL: http://arxiv.org/abs/2304.10530v1
- Date: Thu, 20 Apr 2023 17:59:02 GMT
- Title: Collaborative Diffusion for Multi-Modal Face Generation and Editing
- Authors: Ziqi Huang, Kelvin C.K. Chan, Yuming Jiang, Ziwei Liu
- Abstract summary: We present Collaborative Diffusion, where pre-trained uni-modal diffusion models collaborate to achieve multi-modal face generation and editing without re-training.
Specifically, we propose dynamic diffuser, a meta-network that adaptively hallucinates multi-modal denoising steps by predicting the spatial-temporal influence functions for each pre-trained uni-modal model.
- Score: 34.16906110777047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models arise as a powerful generative tool recently. Despite the
great progress, existing diffusion models mainly focus on uni-modal control,
i.e., the diffusion process is driven by only one modality of condition. To
further unleash the users' creativity, it is desirable for the model to be
controllable by multiple modalities simultaneously, e.g., generating and
editing faces by describing the age (text-driven) while drawing the face shape
(mask-driven). In this work, we present Collaborative Diffusion, where
pre-trained uni-modal diffusion models collaborate to achieve multi-modal face
generation and editing without re-training. Our key insight is that diffusion
models driven by different modalities are inherently complementary regarding
the latent denoising steps, where bilateral connections can be established
upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively
hallucinates multi-modal denoising steps by predicting the spatial-temporal
influence functions for each pre-trained uni-modal model. Collaborative
Diffusion not only collaborates generation capabilities from uni-modal
diffusion models, but also integrates multiple uni-modal manipulations to
perform multi-modal editing. Extensive qualitative and quantitative experiments
demonstrate the superiority of our framework in both image quality and
condition consistency.
Related papers
- Dual Diffusion for Unified Image Generation and Understanding [32.7554623473768]
We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation.
We leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly.
Our model attained competitive performance compared to recent unified image understanding and generation models.
arXiv Detail & Related papers (2024-12-31T05:49:00Z) - Multimodal ELBO with Diffusion Decoders [0.9208007322096533]
We propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model.
The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs.
Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities.
arXiv Detail & Related papers (2024-08-29T20:12:01Z) - Diffusion Models For Multi-Modal Generative Modeling [32.61765315067488]
We propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space.
We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling.
arXiv Detail & Related papers (2024-07-24T18:04:17Z) - InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion [53.90516061351706]
We present InterHandGen, a novel framework that learns the generative prior of two-hand interaction.
For sampling, we combine anti-penetration and synthesis-free guidance to enable plausible generation.
Our method significantly outperforms baseline generative models in terms of plausibility and diversity.
arXiv Detail & Related papers (2024-03-26T06:35:55Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Multi-modal Latent Diffusion [8.316365279740188]
Multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities.
Existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities.
We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders.
arXiv Detail & Related papers (2023-06-07T14:16:44Z) - Diffusion Glancing Transformer for Parallel Sequence to Sequence
Learning [52.72369034247396]
We propose the diffusion glancing transformer, which employs a modality diffusion process and residual glancing sampling.
DIFFGLAT achieves better generation accuracy while maintaining fast decoding speed compared with both autoregressive and non-autoregressive models.
arXiv Detail & Related papers (2022-12-20T13:36:25Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Versatile Diffusion: Text, Images and Variations All in One Diffusion
Model [76.89932822375208]
Versatile Diffusion handles multiple flows of text-to-image, image-to-text, and variations in one unified model.
Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion.
arXiv Detail & Related papers (2022-11-15T17:44:05Z) - Image Generation with Multimodal Priors using Denoising Diffusion
Probabilistic Models [54.1843419649895]
A major challenge in using generative models to accomplish this task is the lack of paired data containing all modalities and corresponding outputs.
We propose a solution based on a denoising diffusion probabilistic synthesis models to generate images under multi-model priors.
arXiv Detail & Related papers (2022-06-10T12:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.