Edit-A-Video: Single Video Editing with Object-Aware Consistency
- URL: http://arxiv.org/abs/2303.07945v4
- Date: Fri, 17 Nov 2023 12:43:46 GMT
- Title: Edit-A-Video: Single Video Editing with Object-Aware Consistency
- Authors: Chaehun Shin, Heeseung Kim, Che Hyun Lee, Sang-gil Lee, Sungroh Yoon
- Abstract summary: We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
- Score: 49.43316939996227
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the fact that text-to-video (TTV) model has recently achieved
remarkable success, there have been few approaches on TTV for its extension to
video editing. Motivated by approaches on TTV models adapting from
diffusion-based text-to-image (TTI) models, we suggest the video editing
framework given only a pretrained TTI model and a single <text, video> pair,
which we term Edit-A-Video. The framework consists of two stages: (1) inflating
the 2D model into the 3D model by appending temporal modules and tuning on the
source video (2) inverting the source video into the noise and editing with
target text prompt and attention map injection. Each stage enables the temporal
modeling and preservation of semantic attributes of the source video. One of
the key challenges for video editing include a background inconsistency
problem, where the regions not included for the edit suffer from undesirable
and inconsistent temporal alterations. To mitigate this issue, we also
introduce a novel mask blending method, termed as sparse-causal blending (SC
Blending). We improve previous mask blending methods to reflect the temporal
consistency so that the area where the editing is applied exhibits smooth
transition while also achieving spatio-temporal consistency of the unedited
regions. We present extensive experimental results over various types of text
and videos, and demonstrate the superiority of the proposed method compared to
baselines in terms of background consistency, text alignment, and video editing
quality.
Related papers
- VideoDirector: Precise Video Editing via Text-to-Video Models [45.53826541639349]
Current video editing methods rely on text-to-video (T2V) models, which inherently lack temporal-coherence generative ability.
We propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion.
Experimental results demonstrate that our method effectively harnesses the powerful temporal generation capabilities of T2V models.
arXiv Detail & Related papers (2024-11-26T16:56:53Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model.
Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z) - Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion [61.42732844499658]
This paper systematically improves the text-guided image editing techniques based on diffusion models.
We incorporate human annotation as an external knowledge to confine editing within a Mask-informed'' region.
arXiv Detail & Related papers (2024-05-24T07:53:59Z) - Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices [19.07572422897737]
We present Slicedit, a method for text-based video editing that utilize a pretrained T2I diffusion model to process both spatial andtemporal slices.
Our method generates videos retain the structure and motion of the original video while adhering to the target text.
arXiv Detail & Related papers (2024-05-20T17:55:56Z) - FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video
editing [65.60744699017202]
We introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing.
Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module.
Results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance.
arXiv Detail & Related papers (2023-10-09T17:59:53Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot
Text-based Video Editing [27.661609140918916]
InFusion is a framework for zero-shot text-based video editing.
It supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt.
Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training.
arXiv Detail & Related papers (2023-07-22T17:05:47Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Diffusion Video Autoencoders: Toward Temporally Consistent Face Video
Editing via Disentangled Video Encoding [35.18070525015657]
We propose a novel face video editing framework based on diffusion autoencoders.
Our model is based on diffusion models and can satisfy both reconstruction and edit capabilities at the same time.
arXiv Detail & Related papers (2022-12-06T07:41:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.