Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation
- URL: http://arxiv.org/abs/2405.04356v1
- Date: Tue, 7 May 2024 14:33:40 GMT
- Title: Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation
- Authors: Jihyun Kim, Changjae Oh, Hoseok Do, Soohyun Kim, Kwanghoon Sohn,
- Abstract summary: We present a new multi-modal face image generation method that converts a text prompt and a visual input, such as a semantic mask or map, into a photo-realistic face image.
We present a simple mapping and a style modulation network to link two models and convert meaningful representations in feature maps and attention maps into latent codes.
Our proposed network produces realistic 2D, multi-view, and stylized face images, which align well with inputs.
- Score: 41.341693150031546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new multi-modal face image generation method that converts a text prompt and a visual input, such as a semantic mask or scribble map, into a photo-realistic face image. To do this, we combine the strengths of Generative Adversarial networks (GANs) and diffusion models (DMs) by employing the multi-modal features in the DM into the latent space of the pre-trained GANs. We present a simple mapping and a style modulation network to link two models and convert meaningful representations in feature maps and attention maps into latent codes. With GAN inversion, the estimated latent codes can be used to generate 2D or 3D-aware facial images. We further present a multi-step training strategy that reflects textual and structural representations into the generated image. Our proposed network produces realistic 2D, multi-view, and stylized face images, which align well with inputs. We validate our method by using pre-trained 2D and 3D GANs, and our results outperform existing methods. Our project page is available at https://github.com/1211sh/Diffusion-driven_GAN-Inversion/.
Related papers
- MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance [32.70801495328193]
We propose a practical framework - MM2Latent - for multimodal image generation and editing.
We use StyleGAN2 as our image generator, FaRL for text encoding, and train an autoencoders for spatial modalities like mask, sketch and 3DMM.
Our method exhibits superior performance in multimodal image generation, surpassing recent GAN- and diffusion-based methods.
arXiv Detail & Related papers (2024-09-17T09:21:07Z) - Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model [65.58911408026748]
We propose Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts.
We first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline.
We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation.
arXiv Detail & Related papers (2024-04-28T04:05:10Z) - Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting [9.383423119196408]
We introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing multi-view diffusion models.
MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation.
In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations.
arXiv Detail & Related papers (2024-03-15T02:57:20Z) - ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models [65.22994156658918]
We present a method that learns to generate multi-view images in a single denoising process from real-world data.
We design an autoregressive generation that renders more 3D-consistent images at any viewpoint.
arXiv Detail & Related papers (2024-03-04T07:57:05Z) - Guide3D: Create 3D Avatars from Text and Image Guidance [55.71306021041785]
Guide3D is a text-and-image-guided generative model for 3D avatar generation based on diffusion models.
Our framework produces topologically and structurally correct geometry and high-resolution textures.
arXiv Detail & Related papers (2023-08-18T17:55:47Z) - 3D-aware Image Generation using 2D Diffusion Models [23.150456832947427]
We formulate the 3D-aware image generation task as multiview 2D image set generation, and further to a sequential unconditional-conditional multiview image generation process.
We utilize 2D diffusion models to boost the generative modeling power of the method.
We train our method on a large-scale dataset, i.e., ImageNet, which is not addressed by previous methods.
arXiv Detail & Related papers (2023-03-31T09:03:18Z) - FENeRF: Face Editing in Neural Radiance Fields [34.332520597067074]
We propose FENeRF, a 3D-aware generator that can produce view-consistent and locally-editable portrait images.
Our method uses two decoupled latent codes to generate corresponding facial semantics and texture in a spatial aligned 3D volume with shared geometry.
Our experiments demonstrate that FENeRF outperforms state-of-the-art methods in various face editing tasks.
arXiv Detail & Related papers (2021-11-30T15:23:08Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.