StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing
- URL: http://arxiv.org/abs/2303.15649v2
- Date: Sun, 20 Aug 2023 11:58:44 GMT
- Title: StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing
- Authors: Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin
Hou, Yaxing Wang, Jian Yang
- Abstract summary: We exploit the amazing capacities of pretrained diffusion models for the editing of images.
They either finetune the model, or invert the image in the latent space of the pretrained model.
They suffer from two problems: Unsatisfying results for selected regions, and unexpected changes in nonselected regions.
- Score: 86.92711729969488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A significant research effort is focused on exploiting the amazing capacities
of pretrained diffusion models for the editing of images. They either finetune
the model, or invert the image in the latent space of the pretrained model.
However, they suffer from two problems: (1) Unsatisfying results for selected
regions, and unexpected changes in nonselected regions. (2) They require
careful text prompt editing where the prompt should include all visual objects
in the input image. To address this, we propose two improvements: (1) Only
optimizing the input of the value linear network in the cross-attention layers,
is sufficiently powerful to reconstruct a real image. (2) We propose attention
regularization to preserve the object-like attention maps after editing,
enabling us to obtain accurate style editing without invoking significant
structural changes. We further improve the editing technique which is used for
the unconditional branch of classifier-free guidance, as well as the
conditional one as used by P2P. Extensive experimental prompt-editing results
on a variety of images, demonstrate qualitatively and quantitatively that our
method has superior editing capabilities than existing and concurrent works.
Related papers
- ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models [11.830273909934688]
Modern Text-to-Image (T2I) Diffusion models have revolutionized image editing by enabling the generation of high-quality images.
We propose ReEdit, a modular and efficient end-to-end framework that captures edits in both text and image modalities.
Our results demonstrate that ReEdit consistently outperforms contemporary approaches both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-11-06T15:19:24Z) - Vision-guided and Mask-enhanced Adaptive Denoising for Prompt-based Image Editing [67.96788532285649]
We present a Vision-guided and Mask-enhanced Adaptive Editing (ViMAEdit) method with three key novel designs.
First, we propose to leverage image embeddings as explicit guidance to enhance the conventional textual prompt-based denoising process.
Second, we devise a self-attention-guided iterative editing area grounding strategy.
arXiv Detail & Related papers (2024-10-14T13:41:37Z) - A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - Zero-shot Image Editing with Reference Imitation [50.75310094611476]
We present a new form of editing, termed imitative editing, to help users exercise their creativity more conveniently.
We propose a generative training framework, dubbed MimicBrush, which randomly selects two frames from a video clip, masks some regions of one frame, and learns to recover the masked regions using the information from the other frame.
We experimentally show the effectiveness of our method under various test cases as well as its superiority over existing alternatives.
arXiv Detail & Related papers (2024-06-11T17:59:51Z) - BARET : Balanced Attention based Real image Editing driven by
Target-text Inversion [36.59406959595952]
We propose a novel editing technique that only requires an input image and target text for various editing types including non-rigid edits without fine-tuning diffusion model.
Our method contains three novelties: (I) Targettext Inversion Schedule (TTIS) is designed to fine-tune the input target text embedding to achieve fast image reconstruction without image caption and acceleration of convergence; (II) Progressive Transition Scheme applies progressive linear approaches between target text embedding and its fine-tuned version to generate transition embedding for maintaining non-rigid editing capability; (III) Balanced Attention Module (BAM) balances the tradeoff between textual description and image semantics
arXiv Detail & Related papers (2023-12-09T07:18:23Z) - Optimisation-Based Multi-Modal Semantic Image Editing [58.496064583110694]
We propose an inference-time editing optimisation to accommodate multiple editing instruction types.
By allowing to adjust the influence of each loss function, we build a flexible editing solution that can be adjusted to user preferences.
We evaluate our method using text, pose and scribble edit conditions, and highlight our ability to achieve complex edits.
arXiv Detail & Related papers (2023-11-28T15:31:11Z) - Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion
Models [6.34777393532937]
We propose an accurate and quick inversion technique, Prompt Tuning Inversion, for text-driven image editing.
Our proposed editing method consists of a reconstruction stage and an editing stage.
Experiments on ImageNet demonstrate the superior editing performance of our method compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2023-05-08T03:34:33Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.