CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP
- URL: http://arxiv.org/abs/2203.00386v1
- Date: Tue, 1 Mar 2022 12:11:32 GMT
- Title: CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP
- Authors: Zihao Wang, Wei Liu, Qian He, Xinglong Wu, Zili Yi
- Abstract summary: We propose a self-supervised scheme named as CLIP-GEN for general text-to-image generation.
In our approach, we only require a set of unlabeled images in the general domain to train a text-to-image generator.
Our method significantly outperforms optimization-based text-to-image methods in terms of image quality.
- Score: 17.861540412002967
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training a text-to-image generator in the general domain (e.g., Dall.e,
CogView) requires huge amounts of paired text-image data, which is too
expensive to collect. In this paper, we propose a self-supervised scheme named
as CLIP-GEN for general text-to-image generation with the language-image priors
extracted with a pre-trained CLIP model. In our approach, we only require a set
of unlabeled images in the general domain to train a text-to-image generator.
Specifically, given an image without text labels, we first extract the
embedding of the image in the united language-vision embedding space with the
image encoder of CLIP. Next, we convert the image into a sequence of discrete
tokens in the VQGAN codebook space (the VQGAN model can be trained with the
unlabeled image dataset in hand). Finally, we train an autoregressive
transformer that maps the image tokens from its unified language-vision
representation. Once trained, the transformer can generate coherent image
tokens based on the text embedding extracted from the text encoder of CLIP upon
an input text. Such a strategy enables us to train a strong and general
text-to-image generator with large text-free image dataset such as ImageNet.
Qualitative and quantitative evaluations verify that our method significantly
outperforms optimization-based text-to-image methods in terms of image quality
while not compromising the text-image matching. Our method can even achieve
comparable performance as flagship supervised models like CogView.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - Learning to Generate Semantic Layouts for Higher Text-Image
Correspondence in Text-to-Image Synthesis [37.32270579534541]
We propose a novel approach for enhancing text-image correspondence by leveraging available semantic layouts.
Our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset.
arXiv Detail & Related papers (2023-08-16T05:59:33Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - CoBIT: A Contrastive Bi-directional Image-Text Generation Model [72.1700346308106]
CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
arXiv Detail & Related papers (2023-03-23T17:24:31Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.