eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers
- URL: http://arxiv.org/abs/2211.01324v1
- Date: Wed, 2 Nov 2022 17:43:04 GMT
- Title: eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers
- Authors: Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song,
Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero
Karras, Ming-Yu Liu
- Abstract summary: Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
- Score: 87.52504764677226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale diffusion-based generative models have led to breakthroughs in
text-conditioned high-resolution image synthesis. Starting from random noise,
such text-to-image diffusion models gradually synthesize images in an iterative
fashion while conditioning on text prompts. We find that their synthesis
behavior qualitatively changes throughout this process: Early in sampling,
generation strongly relies on the text prompt to generate text-aligned content,
while later, the text conditioning is almost entirely ignored. This suggests
that sharing model parameters throughout the entire generation process may not
be ideal. Therefore, in contrast to existing works, we propose to train an
ensemble of text-to-image diffusion models specialized for different synthesis
stages. To maintain training efficiency, we initially train a single model,
which is then split into specialized models that are trained for the specific
stages of the iterative generation process. Our ensemble of diffusion models,
called eDiffi, results in improved text alignment while maintaining the same
inference computation cost and preserving high visual quality, outperforming
previous large-scale text-to-image diffusion models on the standard benchmark.
In addition, we train our model to exploit a variety of embeddings for
conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We
show that these different embeddings lead to different behaviors. Notably, the
CLIP image embedding allows an intuitive way of transferring the style of a
reference image to the target text-to-image output. Lastly, we show a technique
that enables eDiffi's "paint-with-words" capability. A user can select the word
in the input text and paint it in a canvas to control the output, which is very
handy for crafting the desired image in mind. The project page is available at
https://deepimagination.cc/eDiffi/
Related papers
- Contextualized Diffusion Models for Text-Guided Image and Video Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv Detail & Related papers (2024-02-26T15:01:16Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - ProSpect: Prompt Spectrum for Attribute-Aware Personalization of
Diffusion Models [77.03361270726944]
Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models.
We propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information.
We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout.
arXiv Detail & Related papers (2023-05-25T16:32:01Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - Zero-shot Generation of Coherent Storybook from Plain Text Story using
Diffusion Models [43.32978092618245]
We present a novel neural pipeline for generating a coherent storybook from the plain text of a story.
We leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images.
arXiv Detail & Related papers (2023-02-08T06:24:06Z) - DreamBooth: Fine Tuning Text-to-Image Diffusion Models for
Subject-Driven Generation [26.748667878221568]
We present a new approach for "personalization" of text-to-image models.
We fine-tune a pretrained text-to-image model to bind a unique identifier with that specific subject.
The unique identifier can then be used to synthesize fully photorealistic-novel images of the subject contextualized in different scenes.
arXiv Detail & Related papers (2022-08-25T17:45:49Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models [16.786221846896108]
We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
arXiv Detail & Related papers (2021-12-20T18:42:55Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.