Get What You Want, Not What You Don't: Image Content Suppression for
Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2402.05375v1
- Date: Thu, 8 Feb 2024 03:15:06 GMT
- Title: Get What You Want, Not What You Don't: Image Content Suppression for
Text-to-Image Diffusion Models
- Authors: Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin
Hou, Yaxing Wang, Jian Yang
- Abstract summary: We analyze how to manipulate the text embeddings and remove unwanted content from them.
The first regularizes the text embedding matrix and effectively suppresses the undesired content.
The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content.
- Score: 86.92711729969488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of recent text-to-image diffusion models is largely due to their
capacity to be guided by a complex text prompt, which enables users to
precisely describe the desired content. However, these models struggle to
effectively suppress the generation of undesired content, which is explicitly
requested to be omitted from the generated image in the prompt. In this paper,
we analyze how to manipulate the text embeddings and remove unwanted content
from them. We introduce two contributions, which we refer to as
$\textit{soft-weighted regularization}$ and $\textit{inference-time text
embedding optimization}$. The first regularizes the text embedding matrix and
effectively suppresses the undesired content. The second method aims to further
suppress the unwanted content generation of the prompt, and encourages the
generation of desired content. We evaluate our method quantitatively and
qualitatively on extensive experiments, validating its effectiveness.
Furthermore, our method is generalizability to both the pixel-space diffusion
models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable
Diffusion).
Related papers
- Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing [4.948910649137149]
Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation.
We investigate how text and image latents individually and jointly contribute to the semantics of generated images.
We propose a simple and effective Extract-Manipulate-Sample framework for zero-shot fine-grained image editing.
arXiv Detail & Related papers (2024-08-23T19:00:52Z) - Dynamic Prompt Optimizing for Text-to-Image Generation [63.775458908172176]
We introduce the textbfPrompt textbfAuto-textbfEditing (PAE) method to improve text-to-image generative models.
We employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts.
arXiv Detail & Related papers (2024-04-05T13:44:39Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection [53.320946030761796]
diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
arXiv Detail & Related papers (2023-05-22T17:59:41Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Text-Guided Neural Image Inpainting [20.551488941041256]
Inpainting task requires filling the corrupted image with contents coherent with the context.
The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text.
We propose a novel inpainting model named Text-Guided Dual Attention Inpainting Network (TDANet)
arXiv Detail & Related papers (2020-04-07T09:04:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.