Forgedit: Text Guided Image Editing via Learning and Forgetting
- URL: http://arxiv.org/abs/2309.10556v2
- Date: Sun, 17 Mar 2024 16:55:07 GMT
- Title: Forgedit: Text Guided Image Editing via Learning and Forgetting
- Authors: Shiwen Zhang, Shuai Xiao, Weilin Huang,
- Abstract summary: We design a novel text-guided image editing method, named as Forgedit.
First, we propose a vision-language joint optimization framework capable of reconstructing the original image in 30 seconds.
Then, we propose a novel vector projection mechanism in text embedding space of Diffusion Models.
- Score: 17.26772361532044
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-guided image editing on real or synthetic images, given only the original image itself and the target text prompt as inputs, is a very general and challenging task. It requires an editing model to estimate by itself which part of the image should be edited, and then perform either rigid or non-rigid editing while preserving the characteristics of original image. In this paper, we design a novel text-guided image editing method, named as Forgedit. First, we propose a vision-language joint optimization framework capable of reconstructing the original image in 30 seconds, much faster than previous SOTA and much less overfitting. Then we propose a novel vector projection mechanism in text embedding space of Diffusion Models, which is capable to control the identity similarity and editing strength seperately. Finally, we discovered a general property of UNet in Diffusion Models, i.e., Unet encoder learns space and structure, Unet decoder learns appearance and identity. With such a property, we design forgetting mechanisms to successfully tackle the fatal and inevitable overfitting issues when fine-tuning Diffusion Models on one image, thus significantly boosting the editing capability of Diffusion Models. Our method, Forgedit, built on Stable Diffusion, achieves new state-of-the-art results on the challenging text-guided image editing benchmark: TEdBench, surpassing the previous SOTA methods such as Imagic with Imagen, in terms of both CLIP score and LPIPS score. Codes are available at https://github.com/witcherofresearch/Forgedit
Related papers
- TurboEdit: Instant text-based image editing [32.06820085957286]
We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models.
We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image.
Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion and 4 NFEs per edit.
arXiv Detail & Related papers (2024-08-14T18:02:24Z) - The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing [3.58736715327935]
We introduce StyleFeatureEditor, a novel method that enables editing in both w-latents and F-latents.
We also present a new training pipeline specifically designed to train our model to accurately edit F-latents.
Our method is compared with state-of-the-art encoding approaches, demonstrating that our model excels in terms of reconstruction quality.
arXiv Detail & Related papers (2024-06-15T11:28:32Z) - Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion [61.42732844499658]
This paper systematically improves the text-guided image editing techniques based on diffusion models.
We incorporate human annotation as an external knowledge to confine editing within a Mask-informed'' region.
arXiv Detail & Related papers (2024-05-24T07:53:59Z) - DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image
Editing [66.43179841884098]
Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years.
We propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing.
Our method can efficiently achieve state-of-the-art performance on various fine-grained image editing tasks.
arXiv Detail & Related papers (2024-02-04T18:50:29Z) - ZONE: Zero-Shot Instruction-Guided Local Editing [56.56213730578504]
We propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE.
We first convert the editing intent from the user-provided instruction into specific image editing regions through InstructPix2Pix.
We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model.
arXiv Detail & Related papers (2023-12-28T02:54:34Z) - InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot
Text-based Video Editing [27.661609140918916]
InFusion is a framework for zero-shot text-based video editing.
It supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt.
Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training.
arXiv Detail & Related papers (2023-07-22T17:05:47Z) - DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models [66.43179841884098]
We propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models.
Our method achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging.
arXiv Detail & Related papers (2023-07-05T16:43:56Z) - Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion
Models [6.34777393532937]
We propose an accurate and quick inversion technique, Prompt Tuning Inversion, for text-driven image editing.
Our proposed editing method consists of a reconstruction stage and an editing stage.
Experiments on ImageNet demonstrate the superior editing performance of our method compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2023-05-08T03:34:33Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a
Single Image [2.999198565272416]
We make the observation that image-generation models can be converted to image-editing models simply by fine-tuning them on a single image.
We propose UniTune, a novel image editing method. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high fidelity to the input image.
We demonstrate that it is broadly applicable and can perform a surprisingly wide range of expressive editing operations, including those requiring significant visual changes that were previously impossible.
arXiv Detail & Related papers (2022-10-17T23:46:05Z) - EditGAN: High-Precision Semantic Image Editing [120.49401527771067]
EditGAN is a novel method for high quality, high precision semantic image editing.
We show that EditGAN can manipulate images with an unprecedented level of detail and freedom.
We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data.
arXiv Detail & Related papers (2021-11-04T22:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.