Inversion-Free Image Editing with Natural Language
- URL: http://arxiv.org/abs/2312.04965v1
- Date: Thu, 7 Dec 2023 18:58:27 GMT
- Title: Inversion-Free Image Editing with Natural Language
- Authors: Sihan Xu, Yidong Huang, Jiayi Pan, Ziqiao Ma, Joyce Chai
- Abstract summary: We present inversion-free editing (InfEdit), which allows for consistent and faithful editing for both rigid and non-rigid semantic changes.
InfEdit shows strong performance in various editing tasks and also maintains a seamless workflow (less than 3 seconds on one single A40), demonstrating the potential for real-time applications.
- Score: 18.373145158518135
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite recent advances in inversion-based editing, text-guided image
manipulation remains challenging for diffusion models. The primary bottlenecks
include 1) the time-consuming nature of the inversion process; 2) the struggle
to balance consistency with accuracy; 3) the lack of compatibility with
efficient consistency sampling methods used in consistency models. To address
the above issues, we start by asking ourselves if the inversion process can be
eliminated for editing. We show that when the initial sample is known, a
special variance schedule reduces the denoising step to the same form as the
multi-step consistency sampling. We name this Denoising Diffusion Consistent
Model (DDCM), and note that it implies a virtual inversion strategy without
explicit inversion in sampling. We further unify the attention control
mechanisms in a tuning-free framework for text-guided editing. Combining them,
we present inversion-free editing (InfEdit), which allows for consistent and
faithful editing for both rigid and non-rigid semantic changes, catering to
intricate modifications without compromising on the image's integrity and
explicit inversion. Through extensive experiments, InfEdit shows strong
performance in various editing tasks and also maintains a seamless workflow
(less than 3 seconds on one single A40), demonstrating the potential for
real-time applications. Project Page: https://sled-group.github.io/InfEdit/
Related papers
- TurboEdit: Instant text-based image editing [32.06820085957286]
We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models.
We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image.
Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion and 4 NFEs per edit.
arXiv Detail & Related papers (2024-08-14T18:02:24Z) - TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models [53.757752110493215]
We focus on a popular line of text-based editing frameworks - the edit-friendly'' DDPM-noise inversion approach.
We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.
We propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts.
arXiv Detail & Related papers (2024-08-01T17:27:28Z) - DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image
Editing [66.43179841884098]
Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years.
We propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing.
Our method can efficiently achieve state-of-the-art performance on various fine-grained image editing tasks.
arXiv Detail & Related papers (2024-02-04T18:50:29Z) - Tuning-Free Inversion-Enhanced Control for Consistent Image Editing [44.311286151669464]
We present a novel approach called Tuning-free Inversion-enhanced Control (TIC)
TIC correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction.
We also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes.
arXiv Detail & Related papers (2023-12-22T11:13:22Z) - Editing 3D Scenes via Text Prompts without Retraining [80.57814031701744]
DN2N is a text-driven editing method that allows for the direct acquisition of a NeRF model with universal editing capabilities.
Our method employs off-the-shelf text-based editing models of 2D images to modify the 3D scene images.
Our method achieves multiple editing types, including but not limited to appearance editing, weather transition, material changing, and style transfer.
arXiv Detail & Related papers (2023-09-10T02:31:50Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z) - MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image
Synthesis and Editing [54.712205852602736]
We develop MasaCtrl, a tuning-free method to achieve consistent image generation and complex non-rigid image editing simultaneously.
Specifically, MasaCtrl converts existing self-attention in diffusion models into mutual self-attention, so that it can query correlated local contents and textures from source images for consistency.
Extensive experiments show that the proposed MasaCtrl can produce impressive results in both consistent image generation and complex non-rigid real image editing.
arXiv Detail & Related papers (2023-04-17T17:42:19Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.