Hierarchical Text-Conditional Image Generation with CLIP Latents
- URL: http://arxiv.org/abs/2204.06125v1
- Date: Wed, 13 Apr 2022 01:10:33 GMT
- Title: Hierarchical Text-Conditional Image Generation with CLIP Latents
- Authors: Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
- Abstract summary: We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
- Score: 20.476720970770128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive models like CLIP have been shown to learn robust representations
of images that capture both semantics and style. To leverage these
representations for image generation, we propose a two-stage model: a prior
that generates a CLIP image embedding given a text caption, and a decoder that
generates an image conditioned on the image embedding. We show that explicitly
generating image representations improves image diversity with minimal loss in
photorealism and caption similarity. Our decoders conditioned on image
representations can also produce variations of an image that preserve both its
semantics and style, while varying the non-essential details absent from the
image representation. Moreover, the joint embedding space of CLIP enables
language-guided image manipulations in a zero-shot fashion. We use diffusion
models for the decoder and experiment with both autoregressive and diffusion
models for the prior, finding that the latter are computationally more
efficient and produce higher-quality samples.
Related papers
- Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions.
We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption.
We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z) - Unlocking Pre-trained Image Backbones for Semantic Image Synthesis [29.688029979801577]
We propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images.
Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes.
arXiv Detail & Related papers (2023-12-20T09:39:19Z) - StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual
Representation Learners [58.941838860425754]
We show that training self-supervised methods on synthetic images can match or beat the real image counterpart.
We develop a multi-positive contrastive learning method, which we call StableRep.
With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP.
arXiv Detail & Related papers (2023-06-01T17:59:51Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z) - Variational Distribution Learning for Unsupervised Text-to-Image
Generation [42.3246826401366]
We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training.
We employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space.
We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings.
arXiv Detail & Related papers (2023-03-28T16:18:56Z) - Cap2Aug: Caption guided Image to Image data Augmentation [41.53127698828463]
Cap2Aug is an image-to-image diffusion model-based data augmentation strategy using image captions as text prompts.
We generate captions from the limited training images and using these captions edit the training images using an image-to-image stable diffusion model.
This strategy generates augmented versions of images similar to the training images yet provides semantic diversity across the samples.
arXiv Detail & Related papers (2022-12-11T04:37:43Z) - clip2latent: Text driven sampling of a pre-trained StyleGAN using
denoising diffusion and CLIP [1.3733526575192976]
We introduce a new method to efficiently create text-to-image models from a pre-trained CLIP and StyleGAN.
It enables text driven sampling with an existing generative model without any external data or fine-tuning.
We leverage the alignment between CLIP's image and text embeddings to avoid the need for any text labelled data for training the conditional diffusion model.
arXiv Detail & Related papers (2022-10-05T15:49:41Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.