Cut-and-Paste: Subject-Driven Video Editing with Attention Control
- URL: http://arxiv.org/abs/2311.11697v1
- Date: Mon, 20 Nov 2023 12:00:06 GMT
- Title: Cut-and-Paste: Subject-Driven Video Editing with Attention Control
- Authors: Zhichao Zuo, Zhao Zhang, Yan Luo, Yang Zhao, Haijun Zhang, Yi Yang,
Meng Wang
- Abstract summary: We present a novel framework termed Cut-and-Paste for real-word semantic video editing under the guidance of text prompt and additional reference image.
Compared with current methods, the whole process of our method is like cut" the source object to be edited and then " the target object provided by reference image.
- Score: 47.76519877672902
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents a novel framework termed Cut-and-Paste for real-word
semantic video editing under the guidance of text prompt and additional
reference image. While the text-driven video editing has demonstrated
remarkable ability to generate highly diverse videos following given text
prompts, the fine-grained semantic edits are hard to control by plain textual
prompt only in terms of object details and edited region, and cumbersome long
text descriptions are usually needed for the task. We therefore investigate
subject-driven video editing for more precise control of both edited regions
and background preservation, and fine-grained semantic generation. We achieve
this goal by introducing an reference image as supplementary input to the
text-driven video editing, which avoids racking your brain to come up with a
cumbersome text prompt describing the detailed appearance of the object. To
limit the editing area, we refer to a method of cross attention control in
image editing and successfully extend it to video editing by fusing the
attention map of adjacent frames, which strikes a balance between maintaining
video background and spatio-temporal consistency. Compared with current
methods, the whole process of our method is like ``cut" the source object to be
edited and then ``paste" the target object provided by reference image. We
demonstrate that our method performs favorably over prior arts for video
editing under the guidance of text prompt and extra reference image, as
measured by both quantitative and subjective evaluations.
Related papers
- GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models [2.362412515574206]
We propose "GenVideo" for editing videos leveraging target-image aware T2I models.
Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit.
arXiv Detail & Related papers (2024-04-18T23:25:27Z) - InstructBrush: Learning Attention-based Instruction Optimization for Image Editing [54.07526261513434]
InstructBrush is an inversion method for instruction-based image editing methods.
It extracts editing effects from image pairs as editing instructions, which are further applied for image editing.
Our approach achieves superior performance in editing and is more semantically consistent with the target editing effects.
arXiv Detail & Related papers (2024-03-27T15:03:38Z) - UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing [28.140945021777878]
We present UniEdit, a tuning-free framework that supports both video motion and appearance editing.
To realize motion editing while preserving source video content, we introduce auxiliary motion-reference and reconstruction branches.
The obtained features are then injected into the main editing path via temporal and spatial self-attention layers.
arXiv Detail & Related papers (2024-02-20T17:52:12Z) - TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts [119.84478647745658]
TIPEditor is a 3D scene editing framework that accepts both text and image prompts and a 3D bounding box to specify the editing region.
Experiments have demonstrated that TIP-Editor conducts accurate editing following the text and image prompts in the specified bounding box region.
arXiv Detail & Related papers (2024-01-26T12:57:05Z) - MagicStick: Controllable Video Editing via Control Handle
Transformations [109.26314726025097]
MagicStick is a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals.
We present experiments on numerous examples within our unified framework.
We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works.
arXiv Detail & Related papers (2023-12-05T17:58:06Z) - FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video
editing [65.60744699017202]
We introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing.
Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module.
Results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance.
arXiv Detail & Related papers (2023-10-09T17:59:53Z) - Shape-aware Text-driven Layered Video Editing [39.56765973770167]
We present a shape-aware, text-driven video editing method to handle shape changes.
We first propagate the deformation field between the input and edited to all frames.
We then leverage a pre-trained text-conditioned diffusion model as guidance for refining shape distortion and completing unseen regions.
arXiv Detail & Related papers (2023-01-30T18:41:58Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - Prompt-to-Prompt Image Editing with Cross Attention Control [41.26939787978142]
We present an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only.
We show our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.
arXiv Detail & Related papers (2022-08-02T17:55:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.