Implementing and Experimenting with Diffusion Models for Text-to-Image
Generation
- URL: http://arxiv.org/abs/2209.10948v1
- Date: Thu, 22 Sep 2022 12:03:33 GMT
- Title: Implementing and Experimenting with Diffusion Models for Text-to-Image
Generation
- Authors: Robin Zbinden
- Abstract summary: Two models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image.
Text-to-image models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet.
This thesis contributes by reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Taking advantage of the many recent advances in deep learning, text-to-image
generative models currently have the merit of attracting the general public
attention. Two of these models, DALL-E 2 and Imagen, have demonstrated that
highly photorealistic images could be generated from a simple textual
description of an image. Based on a novel approach for image generation called
diffusion models, text-to-image models enable the production of many different
types of high resolution images, where human imagination is the only limit.
However, these models require exceptionally large amounts of computational
resources to train, as well as handling huge datasets collected from the
internet. In addition, neither the codebase nor the models have been released.
It consequently prevents the AI community from experimenting with these
cutting-edge models, making the reproduction of their results complicated, if
not impossible.
In this thesis, we aim to contribute by firstly reviewing the different
approaches and techniques used by these models, and then by proposing our own
implementation of a text-to-image model. Highly based on DALL-E 2, we introduce
several slight modifications to tackle the high computational cost induced. We
thus have the opportunity to experiment in order to understand what these
models are capable of, especially in a low resource regime. In particular, we
provide additional and analyses deeper than the ones performed by the authors
of DALL-E 2, including ablation studies.
Besides, diffusion models use so-called guidance methods to help the
generating process. We introduce a new guidance method which can be used in
conjunction with other guidance methods to improve image quality. Finally, the
images generated by our model are of reasonably good quality, without having to
sustain the significant training costs of state-of-the-art text-to-image
models.
Related papers
- YaART: Yet Another ART Rendering Technology [119.09155882164573]
This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences.
We analyze how these choices affect both the efficiency of the training process and the quality of the generated images.
We demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets.
arXiv Detail & Related papers (2024-04-08T16:51:19Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - Conditional Image Generation with Pretrained Generative Model [1.4685355149711303]
diffusion models have gained popularity for their ability to generate higher-quality images in comparison to GAN models.
These models require a huge amount of data, computational resources, and meticulous tuning for successful training.
We propose methods to leverage pre-trained unconditional diffusion models with additional guidance for the purpose of conditional image generative.
arXiv Detail & Related papers (2023-12-20T18:27:53Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models [16.786221846896108]
We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
arXiv Detail & Related papers (2021-12-20T18:42:55Z) - InvGAN: Invertible GANs [88.58338626299837]
InvGAN, short for Invertible GAN, successfully embeds real images to the latent space of a high quality generative model.
This allows us to perform image inpainting, merging, and online data augmentation.
arXiv Detail & Related papers (2021-12-08T21:39:00Z) - Meta Internal Learning [88.68276505511922]
Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image.
We propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the sample image more effectively.
Our results show that the models obtained are as suitable as single-image GANs for many common image applications.
arXiv Detail & Related papers (2021-10-06T16:27:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.