Dynamic Prompt Learning: Addressing Cross-Attention Leakage for
Text-Based Image Editing
- URL: http://arxiv.org/abs/2309.15664v1
- Date: Wed, 27 Sep 2023 13:55:57 GMT
- Title: Dynamic Prompt Learning: Addressing Cross-Attention Leakage for
Text-Based Image Editing
- Authors: Kai Wang, Fei Yang, Shiqi Yang, Muhammad Atif Butt, Joost van de
Weijer
- Abstract summary: We propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt.
We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.
- Score: 23.00202969969574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale text-to-image generative models have been a ground-breaking
development in generative AI, with diffusion models showing their astounding
ability to synthesize convincing images following an input text prompt. The
goal of image editing research is to give users control over the generated
images by modifying the text prompt. Current image editing techniques are
susceptible to unintended modifications of regions outside the targeted area,
such as on the background or on distractor objects which have some semantic or
visual relationship with the targeted object. According to our experimental
findings, inaccurate cross-attention maps are at the root of this problem.
Based on this observation, we propose Dynamic Prompt Learning (DPL) to force
cross-attention maps to focus on correct noun words in the text prompt. By
updating the dynamic tokens for nouns in the textual input with the proposed
leakage repairment losses, we achieve fine-grained image editing over
particular objects while preventing undesired changes to other image regions.
Our method DPL, based on the publicly available Stable Diffusion, is
extensively evaluated on a wide range of images, and consistently obtains
superior results both quantitatively (CLIP score, Structure-Dist) and
qualitatively (on user-evaluation). We show improved prompt editing results for
Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for
complex multi-object scenes.
Related papers
- Text Guided Image Editing with Automatic Concept Locating and Forgetting [27.70615803908037]
We propose a novel method called Locate and Forget (LaF) to locate potential target concepts in the image for modification.
Compared to the baselines, our method demonstrates its superiority in text-guided image editing tasks both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-05-30T05:36:32Z) - LocInv: Localization-aware Inversion for Text-Guided Image Editing [17.611103794346857]
Text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts.
Existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area.
We propose localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps.
arXiv Detail & Related papers (2024-05-02T17:27:04Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Dynamic Prompt Optimizing for Text-to-Image Generation [63.775458908172176]
We introduce the textbfPrompt textbfAuto-textbfEditing (PAE) method to improve text-to-image generative models.
We employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts.
arXiv Detail & Related papers (2024-04-05T13:44:39Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - Prompt-to-Prompt Image Editing with Cross Attention Control [41.26939787978142]
We present an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only.
We show our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.
arXiv Detail & Related papers (2022-08-02T17:55:41Z) - Blended Diffusion for Text-driven Editing of Natural Images [18.664733153082146]
We introduce the first solution for performing local (region-based) edits in generic natural images.
We achieve our goal by leveraging and combining a pretrained language-image model (CLIP)
To seamlessly fuse the edited region with the unchanged parts of the image, we spatially blend noised versions of the input image with the local text-guided diffusion latent.
arXiv Detail & Related papers (2021-11-29T18:58:49Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.