Prompt-to-Prompt Image Editing with Cross Attention Control
- URL: http://arxiv.org/abs/2208.01626v1
- Date: Tue, 2 Aug 2022 17:55:41 GMT
- Title: Prompt-to-Prompt Image Editing with Cross Attention Control
- Authors: Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch,
Daniel Cohen-Or
- Abstract summary: We present an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only.
We show our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.
- Score: 41.26939787978142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent large-scale text-driven synthesis models have attracted much attention
thanks to their remarkable capabilities of generating highly diverse images
that follow given text prompts. Such text-based synthesis methods are
particularly appealing to humans who are used to verbally describe their
intent. Therefore, it is only natural to extend the text-driven image synthesis
to text-driven image editing. Editing is challenging for these generative
models, since an innate property of an editing technique is to preserve most of
the original image, while in the text-based models, even a small modification
of the text prompt often leads to a completely different outcome.
State-of-the-art methods mitigate this by requiring the users to provide a
spatial mask to localize the edit, hence, ignoring the original structure and
content within the masked region. In this paper, we pursue an intuitive
prompt-to-prompt editing framework, where the edits are controlled by text
only. To this end, we analyze a text-conditioned model in depth and observe
that the cross-attention layers are the key to controlling the relation between
the spatial layout of the image to each word in the prompt. With this
observation, we present several applications which monitor the image synthesis
by editing the textual prompt only. This includes localized editing by
replacing a word, global editing by adding a specification, and even delicately
controlling the extent to which a word is reflected in the image. We present
our results over diverse images and prompts, demonstrating high-quality
synthesis and fidelity to the edited prompts.
Related papers
- An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control [21.624984690721842]
D-Edit is a framework to disentangle the comprehensive image-prompt interaction into several item-prompt interactions.
It is based on pretrained diffusion models with cross-attention layers disentangled and adopts a two-step optimization to build item-prompt associations.
We demonstrate state-of-the-art results in four types of editing operations including image-based, text-based, mask-based editing, and item removal.
arXiv Detail & Related papers (2024-03-07T20:06:29Z) - Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework [19.564048493848272]
Scene Text Editing (STE) is a challenging research problem, that primarily aims towards modifying existing texts in an image.
Existing style-transfer-based approaches have shown sub-par editing performance due to complex image backgrounds, diverse font attributes, and varying word lengths within the text.
We propose a novel font-agnostic scene text editing and rendering framework, named FASTER, for simultaneously generating text in arbitrary styles and locations.
arXiv Detail & Related papers (2023-08-05T15:54:06Z) - Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion
Models [6.34777393532937]
We propose an accurate and quick inversion technique, Prompt Tuning Inversion, for text-driven image editing.
Our proposed editing method consists of a reconstruction stage and an editing stage.
Experiments on ImageNet demonstrate the superior editing performance of our method compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2023-05-08T03:34:33Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image
Inpainting [53.708523312636096]
We present Imagen Editor, a cascaded diffusion model built, by fine-tuning on text-guided image inpainting.
edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training.
To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting.
arXiv Detail & Related papers (2022-12-13T21:25:11Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.