Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering
- URL: http://arxiv.org/abs/2401.06345v1
- Date: Fri, 12 Jan 2024 03:46:29 GMT
- Title: Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering
- Authors: Chang Yu, Junran Peng, Xiangyu Zhu, Zhaoxiang Zhang, Qi Tian, Zhen Lei
- Abstract summary: We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
- Score: 118.53208190209517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The text-to-image synthesis by diffusion models has recently shown remarkable
performance in generating high-quality images. Although performs well for
simple texts, the models may get confused when faced with complex texts that
contain multiple objects or spatial relationships. To get the desired images, a
feasible way is to manually adjust the textual descriptions, i.e., narrating
the texts or adding some words, which is labor-consuming. In this paper, we
propose a framework to learn the proper textual descriptions for diffusion
models through prompt learning. By utilizing the quality guidance and the
semantic guidance derived from the pre-trained diffusion model, our method can
effectively learn the prompts to improve the matches between the input text and
the generated images. Extensive experiments and analyses have validated the
effectiveness of the proposed method.
Related papers
- ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - Coherent Zero-Shot Visual Instruction Generation [15.0521272616551]
This paper introduces a simple, training-free framework to tackle the issues of generating visual instructions.
Our approach systematically integrates text comprehension and image generation to ensure visual instructions are visually appealing.
Our experiments show that our approach can visualize coherent and visually pleasing instructions.
arXiv Detail & Related papers (2024-06-06T17:59:44Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - WordStylist: Styled Verbatim Handwritten Text Generation with Latent
Diffusion Models [8.334487584550185]
We present a latent diffusion-based method for styled text-to-text-content-image generation on word-level.
Our proposed method is able to generate realistic word image samples from different writer styles.
We show that the proposed model produces samples that are aesthetically pleasing, help boosting text recognition performance, and get similar writer retrieval score as real data.
arXiv Detail & Related papers (2023-03-29T10:19:26Z) - Text-to-image Diffusion Models in Generative AI: A Survey [75.32882187215394]
We present a review of state-of-the-art methods on text-conditioned image synthesis, i.e., text-to-image.
We discuss applications beyond text-to-image generation: text-guided creative generation and text-guided image editing.
arXiv Detail & Related papers (2023-03-14T13:49:54Z) - Direct Inversion: Optimization-Free Text-Driven Real Image Editing with
Diffusion Models [0.0]
We propose an optimization-free and zero fine-tuning framework that applies complex and non-rigid edits to a single real image via a text prompt.
We prove our method's efficacy in producing high-quality, diverse, semantically coherent, and faithful real image edits.
arXiv Detail & Related papers (2022-11-15T01:07:38Z) - More Control for Free! Image Synthesis with Semantic Diffusion Guidance [79.88929906247695]
Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image.
We introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.
We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis.
arXiv Detail & Related papers (2021-12-10T18:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.