clip2latent: Text driven sampling of a pre-trained StyleGAN using
denoising diffusion and CLIP
- URL: http://arxiv.org/abs/2210.02347v1
- Date: Wed, 5 Oct 2022 15:49:41 GMT
- Title: clip2latent: Text driven sampling of a pre-trained StyleGAN using
denoising diffusion and CLIP
- Authors: Justin N. M. Pinkney and Chuan Li
- Abstract summary: We introduce a new method to efficiently create text-to-image models from a pre-trained CLIP and StyleGAN.
It enables text driven sampling with an existing generative model without any external data or fine-tuning.
We leverage the alignment between CLIP's image and text embeddings to avoid the need for any text labelled data for training the conditional diffusion model.
- Score: 1.3733526575192976
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a new method to efficiently create text-to-image models from a
pre-trained CLIP and StyleGAN. It enables text driven sampling with an existing
generative model without any external data or fine-tuning. This is achieved by
training a diffusion model conditioned on CLIP embeddings to sample latent
vectors of a pre-trained StyleGAN, which we call clip2latent. We leverage the
alignment between CLIP's image and text embeddings to avoid the need for any
text labelled data for training the conditional diffusion model. We demonstrate
that clip2latent allows us to generate high-resolution (1024x1024 pixels)
images based on text prompts with fast sampling, high image quality, and low
training compute and data requirements. We also show that the use of the well
studied StyleGAN architecture, without further fine-tuning, allows us to
directly apply existing methods to control and modify the generated images
adding a further layer of control to our text-to-image pipeline.
Related papers
- Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - The CLIP Model is Secretly an Image-to-Prompt Converter [26.92989288717742]
The paper demonstrates that the CLIP model, as utilized in Stable Diffusion, inherently possesses the ability to instantaneously convert images into text prompts.
Such an image-to-prompt conversion can be achieved by utilizing a linear projection matrix that is calculated in a closed form.
arXiv Detail & Related papers (2023-05-22T04:52:12Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis [38.22195812238951]
We propose a novel guidance approach for the sampling process in the diffusion model.
Our approach guides the model with semantic features from CLIP embeddings and enforces geometric constraints.
Our results demonstrate the effectiveness of incorporating bounding box and segmentation map guidance in the diffusion model sampling process.
arXiv Detail & Related papers (2023-04-28T00:14:28Z) - Variational Distribution Learning for Unsupervised Text-to-Image
Generation [42.3246826401366]
We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training.
We employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space.
We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings.
arXiv Detail & Related papers (2023-03-28T16:18:56Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.