FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN
Space Optimization
- URL: http://arxiv.org/abs/2112.01573v1
- Date: Thu, 2 Dec 2021 19:27:27 GMT
- Title: FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN
Space Optimization
- Authors: Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, Qiang
Liu
- Abstract summary: We approach text-to-image generation by combining the power of the retrained CLIP representation with an off-the-shelf image generator (GANs)
When promoted by different input text, FuseDream can generate high-quality images with varying objects, backgrounds, artistic styles, even novel counterfactual concepts that do not appear in the training data we use.
- Score: 37.318948462348054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating images from natural language instructions is an intriguing yet
highly challenging task. We approach text-to-image generation by combining the
power of the retrained CLIP representation with an off-the-shelf image
generator (GANs), optimizing in the latent space of GAN to find images that
achieve maximum CLIP score with the given input text. Compared to traditional
methods that train generative models from text to image starting from scratch,
the CLIP+GAN approach is training-free, zero shot and can be easily customized
with different generators.
However, optimizing CLIP score in the GAN space casts a highly challenging
optimization problem and off-the-shelf optimizers such as Adam fail to yield
satisfying results. In this work, we propose a FuseDream pipeline, which
improves the CLIP+GAN approach with three key techniques: 1) an AugCLIP score
which robustifies the CLIP objective by introducing random augmentation on
image. 2) a novel initialization and over-parameterization strategy for
optimization which allows us to efficiently navigate the non-convex landscape
in GAN space. 3) a composed generation technique which, by leveraging a novel
bi-level optimization formulation, can compose multiple images to extend the
GAN space and overcome the data-bias.
When promoted by different input text, FuseDream can generate high-quality
images with varying objects, backgrounds, artistic styles, even novel
counterfactual concepts that do not appear in the training data of the GAN we
use. Quantitatively, the images generated by FuseDream yield top-level
Inception score and FID score on MS COCO dataset, without additional
architecture design or training. Our code is publicly available at
\url{https://github.com/gnobitab/FuseDream}.
Related papers
- Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis [74.71986888051381]
We propose Generative Adrial CLIPs to enable high-quality, efficient, fast, and controllable text-to-image synthesis.
Our model achieves 120 times faster synthesis speed and inherits the smooth latent space from GAN.
arXiv Detail & Related papers (2023-01-30T14:58:23Z) - Bridging CLIP and StyleGAN through Latent Alignment for Image Editing [33.86698044813281]
We bridge CLIP and StyleGAN to achieve inference-time optimization-free diverse manipulation direction mining.
With this mapping scheme, we can achieve GAN inversion, text-to-image generation and text-driven image manipulation.
arXiv Detail & Related papers (2022-10-10T09:17:35Z) - One-Shot Adaptation of GAN in Just One CLIP [51.188396199083336]
We present a novel single-shot GAN adaptation method through unified CLIP space manipulations.
Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization.
We show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-03-17T13:03:06Z) - OptGAN: Optimizing and Interpreting the Latent Space of the Conditional
Text-to-Image GANs [8.26410341981427]
We study how to ensure that generated samples are believable, realistic or natural.
We present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture.
arXiv Detail & Related papers (2022-02-25T20:00:33Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.