CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP
- URL: http://arxiv.org/abs/2203.00386v1
- Date: Tue, 1 Mar 2022 12:11:32 GMT
- Title: CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP
- Authors: Zihao Wang, Wei Liu, Qian He, Xinglong Wu, Zili Yi
- Abstract summary: We propose a self-supervised scheme named as CLIP-GEN for general text-to-image generation.
In our approach, we only require a set of unlabeled images in the general domain to train a text-to-image generator.
Our method significantly outperforms optimization-based text-to-image methods in terms of image quality.
- Score: 17.861540412002967
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training a text-to-image generator in the general domain (e.g., Dall.e,
CogView) requires huge amounts of paired text-image data, which is too
expensive to collect. In this paper, we propose a self-supervised scheme named
as CLIP-GEN for general text-to-image generation with the language-image priors
extracted with a pre-trained CLIP model. In our approach, we only require a set
of unlabeled images in the general domain to train a text-to-image generator.
Specifically, given an image without text labels, we first extract the
embedding of the image in the united language-vision embedding space with the
image encoder of CLIP. Next, we convert the image into a sequence of discrete
tokens in the VQGAN codebook space (the VQGAN model can be trained with the
unlabeled image dataset in hand). Finally, we train an autoregressive
transformer that maps the image tokens from its unified language-vision
representation. Once trained, the transformer can generate coherent image
tokens based on the text embedding extracted from the text encoder of CLIP upon
an input text. Such a strategy enables us to train a strong and general
text-to-image generator with large text-free image dataset such as ImageNet.
Qualitative and quantitative evaluations verify that our method significantly
outperforms optimization-based text-to-image methods in terms of image quality
while not compromising the text-image matching. Our method can even achieve
comparable performance as flagship supervised models like CogView.
Related papers
- CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - Learning to Generate Semantic Layouts for Higher Text-Image
Correspondence in Text-to-Image Synthesis [37.32270579534541]
We propose a novel approach for enhancing text-image correspondence by leveraging available semantic layouts.
Our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset.
arXiv Detail & Related papers (2023-08-16T05:59:33Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures
in Text-to-Image Generation [18.396131717250793]
We introduce GlyphDraw, a general learning framework aiming to endow image generation models with the capacity to generate images coherently embedded with text for any specific language.
Our method not only produces accurate language characters as in prompts, but also seamlessly blends the generated text into the background.
arXiv Detail & Related papers (2023-03-31T08:06:33Z) - CoBIT: A Contrastive Bi-directional Image-Text Generation Model [72.1700346308106]
CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
arXiv Detail & Related papers (2023-03-23T17:24:31Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.