GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models
- URL: http://arxiv.org/abs/2112.10741v2
- Date: Wed, 22 Dec 2021 18:39:39 GMT
- Title: GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models
- Authors: Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela
Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen
- Abstract summary: We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
- Score: 16.786221846896108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have recently been shown to generate high-quality synthetic
images, especially when paired with a guidance technique to trade off diversity
for fidelity. We explore diffusion models for the problem of text-conditional
image synthesis and compare two different guidance strategies: CLIP guidance
and classifier-free guidance. We find that the latter is preferred by human
evaluators for both photorealism and caption similarity, and often produces
photorealistic samples. Samples from a 3.5 billion parameter text-conditional
diffusion model using classifier-free guidance are favored by human evaluators
to those from DALL-E, even when the latter uses expensive CLIP reranking.
Additionally, we find that our models can be fine-tuned to perform image
inpainting, enabling powerful text-driven image editing. We train a smaller
model on a filtered dataset and release the code and weights at
https://github.com/openai/glide-text2im.
Related papers
- Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - DreamDistribution: Prompt Distribution Learning for Text-to-Image
Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts.
These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions.
We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - More Control for Free! Image Synthesis with Semantic Diffusion Guidance [79.88929906247695]
Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image.
We introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.
We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis.
arXiv Detail & Related papers (2021-12-10T18:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.