Lafite2: Few-shot Text-to-Image Generation
- URL: http://arxiv.org/abs/2210.14124v1
- Date: Tue, 25 Oct 2022 16:22:23 GMT
- Title: Lafite2: Few-shot Text-to-Image Generation
- Authors: Yufan Zhou, Chunyuan Li, Changyou Chen, Jianfeng Gao, Jinhui Xu
- Abstract summary: We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
- Score: 132.14211027057766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image generation models have progressed considerably in recent years,
which can now generate impressive realistic images from arbitrary text. Most of
such models are trained on web-scale image-text paired datasets, which may not
be affordable for many researchers. In this paper, we propose a novel method
for pre-training text-to-image generation model on image-only datasets. It
considers a retrieval-then-optimization procedure to synthesize pseudo text
features: for a given image, relevant pseudo text features are first retrieved,
then optimized for better alignment. The low requirement of the proposed method
yields high flexibility and usability: it can be beneficial to a wide range of
settings, including the few-shot, semi-supervised and fully-supervised
learning; it can be applied on different models including generative
adversarial networks (GANs) and diffusion models. Extensive experiments
illustrate the effectiveness of the proposed method. On MS-COCO dataset, our
GAN model obtains Fr\'echet Inception Distance (FID) of 6.78 which is the new
state-of-the-art (SoTA) of GANs under fully-supervised setting. Our diffusion
model obtains FID of 8.42 and 4.28 on zero-shot and supervised setting
respectively, which are competitive to SoTA diffusion models with a much
smaller model size.
Related papers
- MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models [34.611309081801345]
Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation.
In this paper, we propose a novel strategy to scale a generative model across new tasks with minimal compute.
arXiv Detail & Related papers (2024-04-15T17:55:56Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z) - LayoutDiffuse: Adapting Foundational Diffusion Models for
Layout-to-Image Generation [24.694298869398033]
Our method trains efficiently, generates images with both high perceptual quality and layout alignment.
Our method significantly outperforms other 10 generative models based on GANs, VQ-VAE, and diffusion models.
arXiv Detail & Related papers (2023-02-16T14:20:25Z) - StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale
Text-to-Image Synthesis [54.39789900854696]
StyleGAN-T addresses the specific requirements of large-scale text-to-image synthesis.
It significantly improves over previous GANs and outperforms distilled diffusion models in terms of sample quality and speed.
arXiv Detail & Related papers (2023-01-23T16:05:45Z) - Shifted Diffusion for Text-to-image Generation [65.53758187995744]
Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text.
Corgi also achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks.
arXiv Detail & Related papers (2022-11-24T03:25:04Z) - Implementing and Experimenting with Diffusion Models for Text-to-Image
Generation [0.0]
Two models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image.
Text-to-image models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet.
This thesis contributes by reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model.
arXiv Detail & Related papers (2022-09-22T12:03:33Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.