LAFITE: Towards Language-Free Training for Text-to-Image Generation
- URL: http://arxiv.org/abs/2111.13792v1
- Date: Sat, 27 Nov 2021 01:54:45 GMT
- Title: LAFITE: Towards Language-Free Training for Text-to-Image Generation
- Authors: Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer,
Tong Yu, Jiuxiang Gu, Jinhui Xu, Tong Sun
- Abstract summary: We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
- Score: 83.2935513540494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the major challenges in training text-to-image generation models is
the need of a large number of high-quality image-text pairs. While image
samples are often easily accessible, the associated text descriptions typically
require careful human captioning, which is particularly time- and
cost-consuming. In this paper, we propose the first work to train text-to-image
generation models without any text data. Our method leverages the well-aligned
multi-modal semantic space of the powerful pre-trained CLIP model: the
requirement of text-conditioning is seamlessly alleviated via generating text
features from image features. Extensive experiments are conducted to illustrate
the effectiveness of the proposed method. We obtain state-of-the-art results in
the standard text-to-image generation tasks. Importantly, the proposed
language-free model outperforms most existing models trained with full
image-text pairs. Furthermore, our method can be applied in fine-tuning
pre-trained models, which saves both training time and cost in training
text-to-image generation models. Our pre-trained model obtains competitive
results in zero-shot text-to-image generation on the MS-COCO dataset, yet with
around only 1% of the model size and training data size relative to the
recently proposed large DALL-E model.
Related papers
- CLIP-VQDiffusion : Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model [2.9849290402462927]
We propose CLIP-VQDiffusion, which leverage the pretrained CLIP model to provide multimodal text-image representations and strong image generation capabilities.
Our model outperformed previous state-of-the-art methods by 4.4% in clipscore and generated very realistic images even when the text was both in and out of distribution.
arXiv Detail & Related papers (2024-03-22T04:34:59Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Image Captions are Natural Prompts for Text-to-Image Models [70.30915140413383]
We analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts.
We propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data.
Our method significantly improves the performance of models trained on synthetic training data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - Shifted Diffusion for Text-to-image Generation [65.53758187995744]
Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text.
Corgi also achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks.
arXiv Detail & Related papers (2022-11-24T03:25:04Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z) - ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
Image-Text Data [9.3935916515127]
We introduce a new vision-supervised pre-trained model -- ImageBERT -- for image-text joint embedding.
Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them.
arXiv Detail & Related papers (2020-01-22T11:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.