Towards Open-World Text-Guided Face Image Generation and Manipulation
- URL: http://arxiv.org/abs/2104.08910v1
- Date: Sun, 18 Apr 2021 16:56:07 GMT
- Title: Towards Open-World Text-Guided Face Image Generation and Manipulation
- Authors: Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu
- Abstract summary: We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
- Score: 52.83401421019309
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The existing text-guided image synthesis methods can only produce limited
quality results with at most \mbox{$\text{256}^2$} resolution and the textual
instructions are constrained in a small Corpus. In this work, we propose a
unified framework for both face image generation and manipulation that produces
diverse and high-quality images with an unprecedented resolution at 1024 from
multimodal inputs. More importantly, our method supports open-world scenarios,
including both image and text, without any re-training, fine-tuning, or
post-processing. To be specific, we propose a brand new paradigm of text-guided
image generation and manipulation based on the superior characteristics of a
pretrained GAN model. Our proposed paradigm includes two novel strategies. The
first strategy is to train a text encoder to obtain latent codes that align
with the hierarchically semantic of the aforementioned pretrained GAN model.
The second strategy is to directly optimize the latent codes in the latent
space of the pretrained GAN model with guidance from a pretrained language
model. The latent codes can be randomly sampled from a prior distribution or
inverted from a given image, which provides inherent supports for both image
generation and manipulation from multi-modal inputs, such as sketches or
semantic labels, with textual guidance. To facilitate text-guided multi-modal
synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset
consisting of real face images and corresponding semantic segmentation map,
sketch, and textual descriptions. Extensive experiments on the introduced
dataset demonstrate the superior performance of our proposed method. Code and
data are available at https://github.com/weihaox/TediGAN.
Related papers
- TextCLIP: Text-Guided Face Image Generation And Manipulation Without
Adversarial Training [5.239585892767183]
We propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training.
Our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
arXiv Detail & Related papers (2023-09-21T09:34:20Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - SceneComposer: Any-Level Semantic Image Synthesis [80.55876413285587]
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels.
The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level.
We introduce several novel techniques to address the challenges coming with this new setup.
arXiv Detail & Related papers (2022-11-21T18:59:05Z) - LDEdit: Towards Generalized Text Guided Image Manipulation via Latent
Diffusion Models [12.06277444740134]
generic image manipulation using a single model with flexible text inputs is highly desirable.
Recent work addresses this task by guiding generative models trained on the generic image using pretrained vision-language encoders.
We propose an optimization-free method for the task of generic image manipulation from text prompts.
arXiv Detail & Related papers (2022-10-05T13:26:15Z) - More Control for Free! Image Synthesis with Semantic Diffusion Guidance [79.88929906247695]
Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image.
We introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.
We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis.
arXiv Detail & Related papers (2021-12-10T18:55:50Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.