Collaborative Diffusion for Multi-Modal Face Generation and Editing
- URL: http://arxiv.org/abs/2304.10530v1
- Date: Thu, 20 Apr 2023 17:59:02 GMT
- Title: Collaborative Diffusion for Multi-Modal Face Generation and Editing
- Authors: Ziqi Huang, Kelvin C.K. Chan, Yuming Jiang, Ziwei Liu
- Abstract summary: We present Collaborative Diffusion, where pre-trained uni-modal diffusion models collaborate to achieve multi-modal face generation and editing without re-training.
Specifically, we propose dynamic diffuser, a meta-network that adaptively hallucinates multi-modal denoising steps by predicting the spatial-temporal influence functions for each pre-trained uni-modal model.
- Score: 34.16906110777047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models arise as a powerful generative tool recently. Despite the
great progress, existing diffusion models mainly focus on uni-modal control,
i.e., the diffusion process is driven by only one modality of condition. To
further unleash the users' creativity, it is desirable for the model to be
controllable by multiple modalities simultaneously, e.g., generating and
editing faces by describing the age (text-driven) while drawing the face shape
(mask-driven). In this work, we present Collaborative Diffusion, where
pre-trained uni-modal diffusion models collaborate to achieve multi-modal face
generation and editing without re-training. Our key insight is that diffusion
models driven by different modalities are inherently complementary regarding
the latent denoising steps, where bilateral connections can be established
upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively
hallucinates multi-modal denoising steps by predicting the spatial-temporal
influence functions for each pre-trained uni-modal model. Collaborative
Diffusion not only collaborates generation capabilities from uni-modal
diffusion models, but also integrates multiple uni-modal manipulations to
perform multi-modal editing. Extensive qualitative and quantitative experiments
demonstrate the superiority of our framework in both image quality and
condition consistency.
Related papers
- Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models [39.127620891450526]
We introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, to handle both multi-modal data generation and dense visual perception.
We further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set.
arXiv Detail & Related papers (2024-11-07T18:59:53Z) - Diffusion Models For Multi-Modal Generative Modeling [32.61765315067488]
We propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space.
We propose several multimodal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling.
arXiv Detail & Related papers (2024-07-24T18:04:17Z) - InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion [53.90516061351706]
We present InterHandGen, a novel framework that learns the generative prior of two-hand interaction.
For sampling, we combine anti-penetration and synthesis-free guidance to enable plausible generation.
Our method significantly outperforms baseline generative models in terms of plausibility and diversity.
arXiv Detail & Related papers (2024-03-26T06:35:55Z) - FedDiff: Diffusion Model Driven Federated Learning for Multi-Modal and
Multi-Clients [32.59184269562571]
We propose a multi-modal collaborative diffusion federated learning framework called FedDiff.
Our framework establishes a dual-branch diffusion model feature extraction setup, where the two modal data are inputted into separate branches of the encoder.
Considering the challenge of private and efficient communication between multiple clients, we embed the diffusion model into the federated learning communication structure.
arXiv Detail & Related papers (2023-11-16T02:29:37Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Multi-modal Latent Diffusion [8.316365279740188]
Multi-modal Variational Autoencoders are a popular family of models that aim to learn a joint representation of the different modalities.
Existing approaches suffer from a coherence-quality tradeoff, where models with good generation quality lack generative coherence across modalities.
We propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders.
arXiv Detail & Related papers (2023-06-07T14:16:44Z) - Diffusion Glancing Transformer for Parallel Sequence to Sequence
Learning [52.72369034247396]
We propose the diffusion glancing transformer, which employs a modality diffusion process and residual glancing sampling.
DIFFGLAT achieves better generation accuracy while maintaining fast decoding speed compared with both autoregressive and non-autoregressive models.
arXiv Detail & Related papers (2022-12-20T13:36:25Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Versatile Diffusion: Text, Images and Variations All in One Diffusion
Model [76.89932822375208]
Versatile Diffusion handles multiple flows of text-to-image, image-to-text, and variations in one unified model.
Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion.
arXiv Detail & Related papers (2022-11-15T17:44:05Z) - Image Generation with Multimodal Priors using Denoising Diffusion
Probabilistic Models [54.1843419649895]
A major challenge in using generative models to accomplish this task is the lack of paired data containing all modalities and corresponding outputs.
We propose a solution based on a denoising diffusion probabilistic synthesis models to generate images under multi-model priors.
arXiv Detail & Related papers (2022-06-10T12:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.