Imagic: Text-Based Real Image Editing with Diffusion Models
- URL: http://arxiv.org/abs/2210.09276v3
- Date: Mon, 20 Mar 2023 15:58:50 GMT
- Title: Imagic: Text-Based Real Image Editing with Diffusion Models
- Authors: Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali
Dekel, Inbar Mosseri, Michal Irani
- Abstract summary: We demonstrate the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image.
Our proposed method requires only a single input image and a target text.
It operates on real images, and does not require any additional inputs.
- Score: 19.05825157237432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-conditioned image editing has recently attracted considerable interest.
However, most methods are currently either limited to specific editing types
(e.g., object overlay, style transfer), or apply to synthetically generated
images, or require multiple input images of a common object. In this paper we
demonstrate, for the very first time, the ability to apply complex (e.g.,
non-rigid) text-guided semantic edits to a single real image. For example, we
can change the posture and composition of one or multiple objects inside an
image, while preserving its original characteristics. Our method can make a
standing dog sit down or jump, cause a bird to spread its wings, etc. -- each
within its single high-resolution natural image provided by the user. Contrary
to previous work, our proposed method requires only a single input image and a
target text (the desired edit). It operates on real images, and does not
require any additional inputs (such as image masks or additional views of the
object). Our method, which we call "Imagic", leverages a pre-trained
text-to-image diffusion model for this task. It produces a text embedding that
aligns with both the input image and the target text, while fine-tuning the
diffusion model to capture the image-specific appearance. We demonstrate the
quality and versatility of our method on numerous inputs from various domains,
showcasing a plethora of high quality complex semantic image edits, all within
a single unified framework.
Related papers
- Editable Image Elements for Controllable Synthesis [79.58148778509769]
We propose an image representation that promotes spatial editing of input images using a diffusion model.
We show the effectiveness of our representation on various image editing tasks, such as object resizing, rearrangement, dragging, de-occlusion, removal, variation, and image composition.
arXiv Detail & Related papers (2024-04-24T17:59:11Z) - Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion
Models [6.34777393532937]
We propose an accurate and quick inversion technique, Prompt Tuning Inversion, for text-driven image editing.
Our proposed editing method consists of a reconstruction stage and an editing stage.
Experiments on ImageNet demonstrate the superior editing performance of our method compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2023-05-08T03:34:33Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - Direct Inversion: Optimization-Free Text-Driven Real Image Editing with
Diffusion Models [0.0]
We propose an optimization-free and zero fine-tuning framework that applies complex and non-rigid edits to a single real image via a text prompt.
We prove our method's efficacy in producing high-quality, diverse, semantically coherent, and faithful real image edits.
arXiv Detail & Related papers (2022-11-15T01:07:38Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - Text2LIVE: Text-Driven Layered Image and Video Editing [13.134513605107808]
We present a method for zero-shot, text-driven appearance manipulation in natural images and videos.
Given an input image or video and a target text prompt, our goal is to edit the appearance of existing objects.
We demonstrate localized, semantic edits on high-resolution natural images and videos across a variety of objects and scenes.
arXiv Detail & Related papers (2022-04-05T21:17:34Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.