Shape-aware Text-driven Layered Video Editing
- URL: http://arxiv.org/abs/2301.13173v1
- Date: Mon, 30 Jan 2023 18:41:58 GMT
- Title: Shape-aware Text-driven Layered Video Editing
- Authors: Yao-Chih Lee, Ji-Ze Genevieve Jang, Yi-Ting Chen, Elizabeth Qiu,
Jia-Bin Huang
- Abstract summary: We present a shape-aware, text-driven video editing method to handle shape changes.
We first propagate the deformation field between the input and edited to all frames.
We then leverage a pre-trained text-conditioned diffusion model as guidance for refining shape distortion and completing unseen regions.
- Score: 39.56765973770167
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal consistency is essential for video editing applications. Existing
work on layered representation of videos allows propagating edits consistently
to each frame. These methods, however, can only edit object appearance rather
than object shape changes due to the limitation of using a fixed UV mapping
field for texture atlas. We present a shape-aware, text-driven video editing
method to tackle this challenge. To handle shape changes in video editing, we
first propagate the deformation field between the input and edited keyframe to
all frames. We then leverage a pre-trained text-conditioned diffusion model as
guidance for refining shape distortion and completing unseen regions. The
experimental results demonstrate that our method can achieve shape-aware
consistent video editing and compare favorably with the state-of-the-art.
Related papers
- UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing [28.140945021777878]
We present UniEdit, a tuning-free framework that supports both video motion and appearance editing.
To realize motion editing while preserving source video content, we introduce auxiliary motion-reference and reconstruction branches.
The obtained features are then injected into the main editing path via temporal and spatial self-attention layers.
arXiv Detail & Related papers (2024-02-20T17:52:12Z) - DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing [27.014978053413788]
We present a diffusion-based video editing framework, DiffusionAtlas, which can achieve both frame consistency and high fidelity in object appearance.
Our method leverages a visual-temporal diffusion model to edit objects directly on the diffusion atlases, ensuring coherent object identity across frames.
arXiv Detail & Related papers (2023-12-05T23:40:30Z) - MagicStick: Controllable Video Editing via Control Handle
Transformations [109.26314726025097]
MagicStick is a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals.
We present experiments on numerous examples within our unified framework.
We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works.
arXiv Detail & Related papers (2023-12-05T17:58:06Z) - FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video
editing [65.60744699017202]
We introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing.
Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module.
Results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance.
arXiv Detail & Related papers (2023-10-09T17:59:53Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - MagicProp: Diffusion-based Video Editing via Motion-aware Appearance
Propagation [74.32046206403177]
MagicProp disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation.
In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame.
In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach.
arXiv Detail & Related papers (2023-09-02T11:13:29Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.