Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion
Models
- URL: http://arxiv.org/abs/2305.16223v2
- Date: Thu, 1 Jun 2023 02:27:42 GMT
- Title: Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion
Models
- Authors: Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa,
Humphrey Shi
- Abstract summary: Text-to-image (T2I) research has grown explosively in the past year.
One pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science.
In this paper, we take "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users.
- Score: 94.25020178662392
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image (T2I) research has grown explosively in the past year, owing to
the large-scale pre-trained diffusion models and many emerging personalization
and editing approaches. Yet, one pain point persists: the text prompt
engineering, and searching high-quality text prompts for customized results is
more art than science. Moreover, as commonly argued: "an image is worth a
thousand words" - the attempt to describe a desired image with texts often ends
up being ambiguous and cannot comprehensively cover delicate visual details,
hence necessitating more additional controls from the visual domain. In this
paper, we take a bold step forward: taking "Text" out of a pre-trained T2I
diffusion model, to reduce the burdensome prompt engineering efforts for users.
Our proposed framework, Prompt-Free Diffusion, relies on only visual inputs to
generate new images: it takes a reference image as "context", an optional image
structural conditioning, and an initial noise, with absolutely no text prompt.
The core architecture behind the scene is Semantic Context Encoder (SeeCoder),
substituting the commonly used CLIP-based or LLM-based text encoder. The
reusability of SeeCoder also makes it a convenient drop-in component: one can
also pre-train a SeeCoder in one T2I model and reuse it for another. Through
extensive experiments, Prompt-Free Diffusion is experimentally found to (i)
outperform prior exemplar-based image synthesis approaches; (ii) perform on par
with state-of-the-art T2I models using prompts following the best practice; and
(iii) be naturally extensible to other downstream applications such as anime
figure generation and virtual try-on, with promising quality. Our code and
models are open-sourced at https://github.com/SHI-Labs/Prompt-Free-Diffusion.
Related papers
- Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines [33.49257838597258]
Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process.
We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations.
arXiv Detail & Related papers (2024-03-09T09:11:49Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - De-Diffusion Makes Text a Strong Cross-Modal Interface [33.90004746543745]
We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding.
Experiments validate the precision and comprehensiveness of De-Diffusion text representing images.
A single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools.
arXiv Detail & Related papers (2023-11-01T16:12:40Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.