Text2LIVE: Text-Driven Layered Image and Video Editing
- URL: http://arxiv.org/abs/2204.02491v1
- Date: Tue, 5 Apr 2022 21:17:34 GMT
- Title: Text2LIVE: Text-Driven Layered Image and Video Editing
- Authors: Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, Tali Dekel
- Abstract summary: We present a method for zero-shot, text-driven appearance manipulation in natural images and videos.
Given an input image or video and a target text prompt, our goal is to edit the appearance of existing objects.
We demonstrate localized, semantic edits on high-resolution natural images and videos across a variety of objects and scenes.
- Score: 13.134513605107808
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a method for zero-shot, text-driven appearance manipulation in
natural images and videos. Given an input image or video and a target text
prompt, our goal is to edit the appearance of existing objects (e.g., object's
texture) or augment the scene with visual effects (e.g., smoke, fire) in a
semantically meaningful manner. We train a generator using an internal dataset
of training examples, extracted from a single input (image or video and target
text prompt), while leveraging an external pre-trained CLIP model to establish
our losses. Rather than directly generating the edited output, our key idea is
to generate an edit layer (color+opacity) that is composited over the original
input. This allows us to constrain the generation process and maintain high
fidelity to the original input via novel text-driven losses that are applied
directly to the edit layer. Our method neither relies on a pre-trained
generator nor requires user-provided edit masks. We demonstrate localized,
semantic edits on high-resolution natural images and videos across a variety of
objects and scenes.
Related papers
- TexSliders: Diffusion-Based Texture Editing in CLIP Space [17.449209402077276]
We analyze existing editing methods and show that they are not directly applicable to textures.
We propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation.
arXiv Detail & Related papers (2024-05-01T17:57:21Z) - MagicStick: Controllable Video Editing via Control Handle Transformations [49.29608051543133]
MagicStick is a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals.
We present experiments on numerous examples within our unified framework.
We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works.
arXiv Detail & Related papers (2023-12-05T17:58:06Z) - Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches.
We show that this simple approach enables flexible editing that is compatible with current image generation models.
Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Shape-aware Text-driven Layered Video Editing [39.56765973770167]
We present a shape-aware, text-driven video editing method to handle shape changes.
We first propagate the deformation field between the input and edited to all frames.
We then leverage a pre-trained text-conditioned diffusion model as guidance for refining shape distortion and completing unseen regions.
arXiv Detail & Related papers (2023-01-30T18:41:58Z) - Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image
Inpainting [53.708523312636096]
We present Imagen Editor, a cascaded diffusion model built, by fine-tuning on text-guided image inpainting.
edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training.
To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting.
arXiv Detail & Related papers (2022-12-13T21:25:11Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - Prompt-to-Prompt Image Editing with Cross Attention Control [41.26939787978142]
We present an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only.
We show our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.
arXiv Detail & Related papers (2022-08-02T17:55:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.