Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion
Models
- URL: http://arxiv.org/abs/2305.04441v1
- Date: Mon, 8 May 2023 03:34:33 GMT
- Title: Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion
Models
- Authors: Wenkai Dong, Song Xue, Xiaoyue Duan, Shumin Han
- Abstract summary: We propose an accurate and quick inversion technique, Prompt Tuning Inversion, for text-driven image editing.
Our proposed editing method consists of a reconstruction stage and an editing stage.
Experiments on ImageNet demonstrate the superior editing performance of our method compared to the state-of-the-art baselines.
- Score: 6.34777393532937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently large-scale language-image models (e.g., text-guided diffusion
models) have considerably improved the image generation capabilities to
generate photorealistic images in various domains. Based on this success,
current image editing methods use texts to achieve intuitive and versatile
modification of images. To edit a real image using diffusion models, one must
first invert the image to a noisy latent from which an edited image is sampled
with a target text prompt. However, most methods lack one of the following:
user-friendliness (e.g., additional masks or precise descriptions of the input
image are required), generalization to larger domains, or high fidelity to the
input image. In this paper, we design an accurate and quick inversion
technique, Prompt Tuning Inversion, for text-driven image editing.
Specifically, our proposed editing method consists of a reconstruction stage
and an editing stage. In the first stage, we encode the information of the
input image into a learnable conditional embedding via Prompt Tuning Inversion.
In the second stage, we apply classifier-free guidance to sample the edited
image, where the conditional embedding is calculated by linearly interpolating
between the target embedding and the optimized one obtained in the first stage.
This technique ensures a superior trade-off between editability and high
fidelity to the input image of our method. For example, we can change the color
of a specific object while preserving its original shape and background under
the guidance of only a target text prompt. Extensive experiments on ImageNet
demonstrate the superior editing performance of our method compared to the
state-of-the-art baselines.
Related papers
- TurboEdit: Instant text-based image editing [32.06820085957286]
We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models.
We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image.
Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion and 4 NFEs per edit.
arXiv Detail & Related papers (2024-08-14T18:02:24Z) - TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models [53.757752110493215]
We focus on a popular line of text-based editing frameworks - the edit-friendly'' DDPM-noise inversion approach.
We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.
We propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts.
arXiv Detail & Related papers (2024-08-01T17:27:28Z) - Tuning-Free Image Customization with Image and Text Guidance [65.9504243633169]
We introduce a tuning-free framework for simultaneous text-image-guided image customization.
Our approach preserves the semantic features of the reference image subject while allowing modification of detailed attributes based on text descriptions.
Our approach outperforms previous methods in both human and quantitative evaluations.
arXiv Detail & Related papers (2024-03-19T11:48:35Z) - StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [86.92711729969488]
We exploit the amazing capacities of pretrained diffusion models for the editing of images.
They either finetune the model, or invert the image in the latent space of the pretrained model.
They suffer from two problems: Unsatisfying results for selected regions, and unexpected changes in nonselected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image
Inpainting [53.708523312636096]
We present Imagen Editor, a cascaded diffusion model built, by fine-tuning on text-guided image inpainting.
edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training.
To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting.
arXiv Detail & Related papers (2022-12-13T21:25:11Z) - Null-text Inversion for Editing Real Images using Guided Diffusion
Models [44.27570654402436]
We introduce an accurate inversion technique and thus facilitate an intuitive text-based modification of the image.
Our Null-text inversion, based on the publicly available Stable Diffusion model, is extensively evaluated on a variety of images and prompt editing.
arXiv Detail & Related papers (2022-11-17T18:58:14Z) - Direct Inversion: Optimization-Free Text-Driven Real Image Editing with
Diffusion Models [0.0]
We propose an optimization-free and zero fine-tuning framework that applies complex and non-rigid edits to a single real image via a text prompt.
We prove our method's efficacy in producing high-quality, diverse, semantically coherent, and faithful real image edits.
arXiv Detail & Related papers (2022-11-15T01:07:38Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a
Single Image [2.999198565272416]
We make the observation that image-generation models can be converted to image-editing models simply by fine-tuning them on a single image.
We propose UniTune, a novel image editing method. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high fidelity to the input image.
We demonstrate that it is broadly applicable and can perform a surprisingly wide range of expressive editing operations, including those requiring significant visual changes that were previously impossible.
arXiv Detail & Related papers (2022-10-17T23:46:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.